Research On Compression And Indexing Methods For High-throughput Sequencing Data

Posted on:2020-07-30

Degree:Master

Type:Thesis

Country:China

Candidate:C H Wang

Full Text:PDF

GTID:2370330602451881

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

In the past ten years,Gene sequencing technology has developed rapidly and the cost of sequencing and sequencing cycles have dropped dramatically.Massive sequencing data not only puts higher requirements on network transmission and local storage,but also brings inconvenience to the analysis and use of sequencing data.The FASTQ file format is one of the storage formats used by the most mainstream sequencing technologies.The current mainstream compression index algorithm cannot solve the compression and indexing problems of FASTQ files.Therefore,it is necessary to design a Compressed and indexing algorithm for FASTQ files to solve the problem for storing and retrieving sequencing data.This thesis proposes a compression indexing algorithm EFASTQ for FASTQ files.The algorithm is based on the characteristics of FASTQ files,pre-processing the files,and then compressing and storing them.Based on the compressed files,the index structure is constructed and search in FASTQ is implemented.The EFASTQ algorithm first classifies and extracts the read sequence,the quality score sequence,and the identifier sequence contained in the FASTQ file.For the compression and indexing requirements of read sequences,EFASTQ uses the compression index algorithm FM-EF proposed in this thesis to compress read sequences,and implements read count algorithm,read positioning algorithm and read extraction algorithm.The time complexity of the read counting algorithm is O?plog₂|?_R|?,where p represents the read length to be counted,|?_R|represents the read data symbol table size;the read positioning algorithm has a time complexity of O?occ_R??log₂|?_R|?log n_R?²/log₂log₂ n_R+1?),where occ_R is the number of locations to be located,n_R is the size of the read data;the algorithm complexity of the read extraction algorithm is O?log|?_R|??log₂ n_R?²/loglog₂ n_R+len_R??,where len_R represents the length of the extraction.For the feature that the identifier sequence is composed of multiple different fields,the EFASTQ algorithm classifies the fields in the identifier sequence,encodes different types of fields in different ways,and uses the compressed index proposed in this thesis.The algorithm FM-EG compresses and implements the mass fraction extraction algorithm.The time complexity of the algorithm is O??len_Q+?log₂n_Q?²/log₂log₂n_Q?log|?_Q|+len_Q?.Aiming at the characteristics of the string consisting of consecutive identical characters in the sequence of quality scores,the run length coding is used to preprocess the quality score sequence,the encoded result is compressed by FM-EG algorithm,and the identifier extraction algorithm is implemented.The time complexity of the algorithm is O??len_I+?log₂n_I?²/log₂log₂n_I?�log₂|?_I|?.Based on the above algorithm,the FASTQ extraction algorithm is implemented.The time complexity of the algorithm is O((len_FQ+(log₂n_FQ)²/log₂log₂ n_FQ)�log₂|?_FQ|+len_FQ),where len_FQ is the extracted FASTQ data set.The length,n_FQ is the FASTQ file size after preprocessing,and|?_FQ|indicates the character set size of the preprocessed FASTQ file.In addition,when optimizing the retrieval algorithm in the EFASTQ algorithm,it is found that the suffix array sampling strategy adopted by the index has a great influence on the positioning algorithm.In response to this situation,the text uses a value sampling strategy to sample the suffix array and design the corresponding structure and algorithm.The experimental content of this thesis is divided into two parts.The first part evaluates the compression retrieval performance of the EFASTQ algorithm.In terms of retrieval performance,the EFASTQ algorithm and the FASTQ file compression retrieval algorithm BEETL-FASTQ algorithm were tested for retrieval performance.The experimental results show that EFASTQ has higher efficiency in terms of retrieval performance,especially the EFASTQ short-reading algorithm is about 10 times faster than the BEETL-FASTQ algorithm.In terms of compression performance,the compression performance experiment was carried out by the classical text compression algorithm Gzip,the industry-leading FASTQ file compression algorithm DSRC2,the FASTQ file compression retrieval algorithm BEETL-FASTQ,and the FASTQ file quality score compression algorithm AQUa.The experimental results show that the EFASTQ algorithm is close to the DSRC2 algorithm and the AQUa algorithm in terms of compression ratio,and has a great advantage for the Gzip and BEETL algorithms.The second part of the experiment evaluated the performance of the positioning algorithm for two different suffix array sampling strategies.The performance of the positioning algorithm was compared by two sampling suffix array sampling strategies through design experiments.The experimental results show that the value sampling strategy can improve the retrieval performance by 15%�20%compared with the position sampling strategy.

Keywords/Search Tags:

FASTQ format, compression, index, sampling strategy

PDF Full Text Request

Related items

1	Research On Lossless Compression Algorithms For FASTQ Files
2	The Research Of Reference-based Compression Specified For Sequence Data
3	Lossless Comprssion Of High-throughput DNA Sequence Data
4	Optimization And Implementation Of Lossless Compression Of Gene Sequencing Data
5	The Research Of Multi-format Seismic Data Access And Conversion Technology
6	Data-Compression For DNA Sequences Using M-index
7	Studies On Structured Storage And Compression Of Mass Spectrometry Data
8	Design And Implementation Of Hardware Based On Compression Sampling Algorithm For Radio Astronomy Signal
9	A Unified Format And Its Editor For Exchange And Presentation Deep Sequencing Data
10	Study On Acceptance Sampling Plan Based On Different Quality Characteristics Index