Data-Compression For DNA Sequences Using M-index

Posted on:2018-07-15

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Li

Full Text:PDF

GTID:2310330521450907

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

DNA as a carrier for long-term storage of biological genetic information and the genetic data it recorded is of great research value.In recent years,with the rapid development of next-generation sequencing(NGS)technology,massive DNA data is emerging,and the application of DNA information is becoming more and more extensive.How to efficiently store the rapidly growing DNA Data and make the effective random access and pattern matching on it,has become one of the important issue of modern bio-information.DNA data is a special kind of biological data,it always contains a large amount of information,its character sets are relatively small,and there exist a large number of repetitive or similar fragments in it.Therefore,if people directly apply the general data compression index algorithm to the DNA data,they can hardly get the ideal effect of compression.In order to further improve the data compression rate of DNA data,we need to design a dedicated DNA data compression index algorithm by using the characteristics of DNA data and the efficient algorithm ideas of the general compression index algorithm.In this paper,we firstly designed the ALCS mapping structure between two DNA sequences.This structure can quickly extract the difference information between two similar DNA sequences,and effectively avoid the repeated storage of common information between them by the process of finding approximating the longest common subsequence ALCS.ALCS is a simplified structure proposed on the basis of the longest common subsequence LCS.It uses the local optimal idea to improve the mapping structure effectively and reduce the peak memory required by the process on base on the correct mapping information.Secondly,based on the ALCS mapping structure,this paper designs and implements an efficient compression index algorithm ALCS-FM for DNA sequence set.This algorithm firstly uses FM-index to build an index for the reference sequence and then map other non-reference sequence to the reference sequence by the mapping structure and the tag arrays.By this way,we can transform the complete non-reference sequence information into the smaller different information.So that we can achieve the efficient compression storage for number of similar gene sequences and a variety of query operations within the entire sequence set.Finally,by using the data characteristics of the tag arrays,we designs a hybrid coding structure suitable for highly sparse 0/1 sequences and a binary storage structures for two similar 0/1 sequences.This structure can be used to support a series of queries and can also get effective compression storage on the tag arrays.So it can effectively improve the compression efficiency and query efficiency of the algorithm.The experimental result shows that the ALCS-FM algorithm has a very significant compression effect on multiple DNA sequences with different similarities,and it can support the random access and pattern matching operation for the DNA sequences,but the query efficiency is still slightly inadequate.

Keywords/Search Tags:

LCS, Reference Sequence, FM-index, Hybrid coding, compression index

PDF Full Text Request

Related items

1	High-throughput Genome Resequencing Data Compression Algorithm Based On Self-index Structure
2	The Research Of Reference-based Compression Specified For Sequence Data
3	Research On Fast Migration Algorithm Between Reference Gene Compression Libraries
4	Research On Entropy-Compressed Structure Of Sequence And Text
5	Lossless Comprssion Of High-throughput DNA Sequence Data
6	The Research On DNA Sequence Compression With Transform Coding And Entropy Coding
7	Coding And Non-coding Sequence Analyzing Based On Z-curve Theory
8	The Extremal Values Of Topological Index Of Caterpillars With Given Degree Sequence
9	Compression Of DNA Sequences Based On Reference Sequences And Weighting Of Context Models
10	Research Of Genome Data Compression Algorithm Based On Reference Sequence And Suffix Array