Font Size: a A A

Research And Implementation Of Sequence Assembly Parallel Programming On Bi-directed De Bruijn Graph

Posted on:2013-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:J R YuanFull Text:PDF
GTID:2230330374489275Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
DNA sequence assembly is an important research subject in the area of bioinformatics. With the appearance of high throughput short read sequencing technology, the sequence genomes achieve a higher coverage, which brings serious challenges for the original sequence assembler. Efficient genome assembler which can scale to large genomes is the key of processing next DNA sequencing data. Therefore, this thesis concentrates on how to improve the speed of sequence assembly combing with parallel computing.By the analysis of existing assembly methods and relevant technology based on de Bruijn graph, the mathematical model of multi-step bi-directed de Bruijn graph are established in this thesis(bi-directed de Bruijn graph for short). Then the properties of this graph are deduced and demonstrated, and thus sequence assembly algorithm based on bi-directed de Bruijn graph is obtained. The algorithm proposed can get full-extended bi-directed edges through mergeing semi-extended bi-directed edges, namely contigs during DNA assembly.Combined with the MPI parallel programming techniques, sequencing assembler is implemented based on bi-directed de Bruijn graph, including four sections, namely parallel I/O designing, constructing bi-directed de Bruijn subgraph, storing and constructing bi-directed de Bruijn graph and mergeing bi-directed edges. Its computing complexity is O(n/p), communication complexity is O(n/p), the amount of communication of single node is O(n log(n)/p), where n is the number of the reads, and p is the number of CPU.The experimental results demonstrate that parallel sequence assembly technology based on bi-directed de Bruijn graph in this thesis has higher operation speed, lower memory consumption per computing node. The assembler used to assemble C.elegans in this thesis has a factor of20times speedup when the number of processors scales from10to640, and has the ability of handling large-scale graph within a constant amount of memory in a cluster.
Keywords/Search Tags:sequence technology, genome assembly, de Bruijn graph, parallel computing
PDF Full Text Request
Related items