Font Size: a A A

An Alignment Algorithm For DNA Short Reads Based On The Hamming Distance

Posted on:2014-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:X S YangFull Text:PDF
GTID:2250330422451938Subject:Computer technology
Abstract/Summary:PDF Full Text Request
New sequencing technologies greatly promote the domestic and foreignscholars on the life science research. In most of the life science studies, thenew-generation sequencing data has been mapped to a reference genome as a firststep. As reads from the new-generation sequencing data are short and large-scale,the traditional alignment algorithms are no longer used in the sequencing dataalignment. For the large-scale sequencing data alignment, this paper designs a fastand efficient alignment algorithm for short reads within a limited hamming distance.Firstly, this paper briefly describes the process of building two hash indextables from a reference genome and the sequencing data respectively. Every item ofeach table corresponds to a data block. When the two blocks are very large, thenumber of comparisons between them is large, even to several hundred billion times.To avoid unnecessary comparisons and reduce the number of comparisons betweenthe two blocks, this paper presents a strategy that is to sort them blocks firstly, andthen compare between them in the limited hamming distance. This alignmentalgorithm for short reads is designed and implemented based on the strategy.Then this paper describes the strategy of alignment between two large blockswithin the limited hamming distance in detail. Then this paper chooses a suitablesorting algorithm for data blocks from several basic sorting algorithms byexperimental analysis under different size of input data.Then this paper uses two ways to compare two sorted large blocks within thelimited hamming distance. One way is that for an item from the smaller block, thesorting sequence in the item replaces nucleotides within the hamming distance togenerate combinations and then search for the ordered combinations in the largerblock. The other way is that for all items from the smaller blocks, the sortingsequences in them replace nucleotides within the hamming distance to generatecombinations and sort all the combinations to form a new data block. Then the newblock aligns another large block linearly downward. In the case when both blocksare very large, the paper uses the first matching method and when two data blocksare relatively large, the paper uses the second way. This paper designs the algorithmand analyzes time and space complexity of the two ways in detail.Finally, the paper evaluates the performance of the algorithm. In comparison toother mapping algorithms, the algorithm has a significant advantage in the speedand accuracy.
Keywords/Search Tags:next-generation sequencing data, short-read alignment, the hamming distance
PDF Full Text Request
Related items