| The rapid development of sequencing technology has not only reduced the cost of sequencing,but also increased the accuracy of sequencing results,promoting the development of bioinformatics in various research directions.The explosive growth of sequencing data has also brought challenges to data storage and transmission,and currently lossless compression techniques are widely used to deal with this problem.In addition,to meet the downstream needs of gene data analysis,the reads sequences obtained from sequencing need to be assembled into longer gene sequences to form more complete genome data.This article focuses on two problems related to gene sequences: lossless compression and assembly.For fastq files,a reference-free lossless compression algorithm RFGSC was proposed to solve the problem of large storage space required for storing sequencing sequences.A de novo sequence assembly algorithm ODO based on a hybrid strategy was developed to solve the problems of poor global performance,weak specificity,and complex pathway graphs that exist in single-strategy sequence assembly.The ODO algorithm integrates the advantages of multiple strategies,improving the accuracy and reliability of assembly.The main work is as follows:(1)RFGSC algorithm is a reference-free lossless compression algorithm.The algorithm mainly consists of three parts.Firstly,the input fastq file is cut into three subsequence files according to the sequence type.Then,the corresponding compression techniques and methods are used to compress different files,and finally a generalpurpose compressor is used for secondary compression.To verify the compression performance of the RFGSC algorithm,comparative experiments were conducted with other compression algorithms,and the results showed that the RFGSC algorithm achieved good results in compression rate and compression time.(2)ODO algorithm is a de novo sequence assembly algorithm based on a hybrid strategy in a reference-free genome.Three basic strategies for de novo sequence assembly were analyzed in this article,and the construction process of the hybrid strategy sequence assembly algorithm was studied.Using formal methods,domain analysis modeling,and production programming methods,two algorithms based on the OLC strategy and one algorithm based on the DBG strategy were constructed,and then assembled into an(OLC+DBG)→OLC hybrid mode algorithm(referred to as the ODO algorithm).Finally,the assembly results of the single-strategy algorithms and the ODO construction algorithm were compared from the evaluation dimensions of N50,Contigs number,and Coverage.The impact of coverage depth and k-value changes on the assembly results was analyzed.The results showed that compared with other sequence assembly algorithms,the results produced by the ODO algorithm had certain advantages in N50 and Coverage evaluation dimensions.The lossless compression algorithm RFGSC proposed in this article alleviates the problem of large data storage space and long transmission time,and the sequence assembly algorithm ODO guarantees the reliability of assembly results when assembling short sequences into longer gene sequences,providing a good data basis for downstream gene analysis.The successful implementation and good performance of the algorithm provide a certain reference for subsequent related research. |