Font Size: a A A

Research On Compression And Indexing Methods For High-throughput Sequencing Data

Posted on:2020-07-30Degree:MasterType:Thesis
Country:ChinaCandidate:C H WangFull Text:PDF
GTID:2370330602451881Subject:Engineering
Abstract/Summary:PDF Full Text Request
In the past ten years,Gene sequencing technology has developed rapidly and the cost of sequencing and sequencing cycles have dropped dramatically.Massive sequencing data not only puts higher requirements on network transmission and local storage,but also brings inconvenience to the analysis and use of sequencing data.The FASTQ file format is one of the storage formats used by the most mainstream sequencing technologies.The current mainstream compression index algorithm cannot solve the compression and indexing problems of FASTQ files.Therefore,it is necessary to design a Compressed and indexing algorithm for FASTQ files to solve the problem for storing and retrieving sequencing data.This thesis proposes a compression indexing algorithm EFASTQ for FASTQ files.The algorithm is based on the characteristics of FASTQ files,pre-processing the files,and then compressing and storing them.Based on the compressed files,the index structure is constructed and search in FASTQ is implemented.The EFASTQ algorithm first classifies and extracts the read sequence,the quality score sequence,and the identifier sequence contained in the FASTQ file.For the compression and indexing requirements of read sequences,EFASTQ uses the compression index algorithm FM-EF proposed in this thesis to compress read sequences,and implements read count algorithm,read positioning algorithm and read extraction algorithm.The time complexity of the read counting algorithm is O?plog2|?R|?,where p represents the read length to be counted,|?R|represents the read data symbol table size;the read positioning algorithm has a time complexity of O?occR??log2|?R|?log nR?2/log2log2 nR+1?),where occR is the number of locations to be located,nR is the size of the read data;the algorithm complexity of the read extraction algorithm is O?log|?R|??log2 nR?2/loglog2 nR+lenR??,where lenR represents the length of the extraction.For the feature that the identifier sequence is composed of multiple different fields,the EFASTQ algorithm classifies the fields in the identifier sequence,encodes different types of fields in different ways,and uses the compressed index proposed in this thesis.The algorithm FM-EG compresses and implements the mass fraction extraction algorithm.The time complexity of the algorithm is O??lenQ+?log2nQ?2/log2log2nQ?log|?Q|+lenQ?.Aiming at the characteristics of the string consisting of consecutive identical characters in the sequence of quality scores,the run length coding is used to preprocess the quality score sequence,the encoded result is compressed by FM-EG algorithm,and the identifier extraction algorithm is implemented.The time complexity of the algorithm is O??lenI+?log2nI?2/log2log2nI?×log2|?I|?.Based on the above algorithm,the FASTQ extraction algorithm is implemented.The time complexity of the algorithm is O((lenFQ+(log2nFQ)2/log2log2 nFQ)×log2|?FQ|+lenFQ),where lenFQ is the extracted FASTQ data set.The length,nFQ is the FASTQ file size after preprocessing,and|?FQ|indicates the character set size of the preprocessed FASTQ file.In addition,when optimizing the retrieval algorithm in the EFASTQ algorithm,it is found that the suffix array sampling strategy adopted by the index has a great influence on the positioning algorithm.In response to this situation,the text uses a value sampling strategy to sample the suffix array and design the corresponding structure and algorithm.The experimental content of this thesis is divided into two parts.The first part evaluates the compression retrieval performance of the EFASTQ algorithm.In terms of retrieval performance,the EFASTQ algorithm and the FASTQ file compression retrieval algorithm BEETL-FASTQ algorithm were tested for retrieval performance.The experimental results show that EFASTQ has higher efficiency in terms of retrieval performance,especially the EFASTQ short-reading algorithm is about 10 times faster than the BEETL-FASTQ algorithm.In terms of compression performance,the compression performance experiment was carried out by the classical text compression algorithm Gzip,the industry-leading FASTQ file compression algorithm DSRC2,the FASTQ file compression retrieval algorithm BEETL-FASTQ,and the FASTQ file quality score compression algorithm AQUa.The experimental results show that the EFASTQ algorithm is close to the DSRC2 algorithm and the AQUa algorithm in terms of compression ratio,and has a great advantage for the Gzip and BEETL algorithms.The second part of the experiment evaluated the performance of the positioning algorithm for two different suffix array sampling strategies.The performance of the positioning algorithm was compared by two sampling suffix array sampling strategies through design experiments.The experimental results show that the value sampling strategy can improve the retrieval performance by 15%—20%compared with the position sampling strategy.
Keywords/Search Tags:FASTQ format, compression, index, sampling strategy
PDF Full Text Request
Related items