Font Size: a A A

Eulerian Path-Based Parallel DNA Sequence Assembly

Posted on:2011-11-05Degree:MasterType:Thesis
Country:ChinaCandidate:D C ChenFull Text:PDF
GTID:2120330338979953Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
DNA assembly plays an important role in DNA sequencing. From the invention of Sanger sequencing technology in 1977, to the invention of the next-generation sequencing technology in 2005, DNA sequencing mainly uses Sanger sequencing technology. The length of DNA fragments sequenced by Sanger sequencing technology can up to 1000bp, and these fragments's accuracy can be up to 99.999%. The DNA fragments obtained by the Sanger sequencing technology usually take overlap-layout-consensus algorithm as assembly algorithm.Compared with the first generation of sequencing technology, the next-generation DNA sequencing technology generate shorter fragments, these fragments have higher error rate and have quite higher throughput in the same time. According to the characteristics of the next-generation sequencing technology, currently there are three assembly strategies to assembly DNA fragments produced by the next generation sequencing technology: Greedy algorithm, Overlap-Layout-Consensus algorithm, and the de Bruijn graph-based Eulerian Path algorithm. The first two algorithm need to calculate the overlap of all DNA fragments, so their time complexity is high. De Bruijn graph-based Eulerian path algorithm converts DNA assembly problem to Eulerian Path problem by splitting read into k-mer, which has linear time algorithm.In this paper we adopt Eulerian Path algorithm as DNA fragments assembly algorithm. The throughput of the next-generation sequencing technology is very high. The next-generation sequencing technology can generate read data on the number of Gigabyte in one run. The difficulty is the space the de Bruijn graph-based Eulerian path algorithm costs is too large. This paper describes a de Bruijn graph based parallel assembly algorithm, it stores the k-mer produced by splitting all reads to hash table lay on multiple processes and coding all k-mer to reduce memory consumption. The assembly process will execute in parallel, and during assembly process, the various assembly processes share data by sending and receive package. Experiments show that this parallel assembly algorithm has nearly linear time complexity and space complexity, so it has good scalability to address large-scale genome fragment assembly problem.
Keywords/Search Tags:Eulerian path, de Bruijn graph, parallel assembly, hash mapping
PDF Full Text Request
Related items