An Alignment Algorithm For DNA Short Reads Based On The Hamming Distance

Posted on:2014-03-08

Degree:Master

Type:Thesis

Country:China

Candidate:X S Yang

Full Text:PDF

GTID:2250330422451938

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

New sequencing technologies greatly promote the domestic and foreignscholars on the life science research. In most of the life science studies, thenew-generation sequencing data has been mapped to a reference genome as a firststep. As reads from the new-generation sequencing data are short and large-scale,the traditional alignment algorithms are no longer used in the sequencing dataalignment. For the large-scale sequencing data alignment, this paper designs a fastand efficient alignment algorithm for short reads within a limited hamming distance.Firstly, this paper briefly describes the process of building two hash indextables from a reference genome and the sequencing data respectively. Every item ofeach table corresponds to a data block. When the two blocks are very large, thenumber of comparisons between them is large, even to several hundred billion times.To avoid unnecessary comparisons and reduce the number of comparisons betweenthe two blocks, this paper presents a strategy that is to sort them blocks firstly, andthen compare between them in the limited hamming distance. This alignmentalgorithm for short reads is designed and implemented based on the strategy.Then this paper describes the strategy of alignment between two large blockswithin the limited hamming distance in detail. Then this paper chooses a suitablesorting algorithm for data blocks from several basic sorting algorithms byexperimental analysis under different size of input data.Then this paper uses two ways to compare two sorted large blocks within thelimited hamming distance. One way is that for an item from the smaller block, thesorting sequence in the item replaces nucleotides within the hamming distance togenerate combinations and then search for the ordered combinations in the largerblock. The other way is that for all items from the smaller blocks, the sortingsequences in them replace nucleotides within the hamming distance to generatecombinations and sort all the combinations to form a new data block. Then the newblock aligns another large block linearly downward. In the case when both blocksare very large, the paper uses the first matching method and when two data blocksare relatively large, the paper uses the second way. This paper designs the algorithmand analyzes time and space complexity of the two ways in detail.Finally, the paper evaluates the performance of the algorithm. In comparison toother mapping algorithms, the algorithm has a significant advantage in the speedand accuracy.

Keywords/Search Tags:

next-generation sequencing data, short-read alignment, the hamming distance

PDF Full Text Request

Related items

1	The Study On Read Alignment Algorithm For High-throughput Sequencing Datasets
2	Researches On Long Read Alignment Algorithms Oriented To The Third Generation Sequencing Technology
3	Research On Calling Methods Of Structural Variation Based On Third Generation Sequencing Data
4	Optimizing High-throughput Biological Gene Sequencing Data Processing Algorithms Based On Hash
5	Research On Sequence Alignment Methods For The Third-generation Sequencing Data
6	Algorithms Of Aligning The Third-Generation Sequencing Sequences And Picking The Operational Taxonomic Units
7	Researches Of Short Sequence Alignment And Scaffold Algorithm Based On Next Generation Sequencing
8	Design And Optimization Of High-Performance Algorithms For Processing Biological Sequence Data
9	Optimization And Implementation Of Lossless Compression Of Gene Sequencing Data
10	A Sequence Alignment Algorithm With Combining Variants Data