Font Size: a A A

Sequence Assembly Algorithms For Next-generation Sequencing Technology Research

Posted on:2013-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:C Y ChenFull Text:PDF
GTID:2210330374462878Subject:Biological Information Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years, the next-generation sequencing technologies have beenevolving rapidly. Compared to traditional Sanger sequencing which is costprohibitive, the next-generation sequencing (NGS) platforms are able togenerate enormous numbers of sequence reads rapidly at markedly reducedprice, which makes it possible to sequence more genomes. However, thelarge amount of sequence reads generated by NGS platforms are much higherin error rate and much shorter in length, which imposes a big challenge onsequence assembly. For this reason, it is important to develop effective dataprocessing methods specific to the large volume of error-prone reads fromNGS platforms to help genome assembly to bring the promise of wholegenome sequencing to fruition.In order to reduce the serious impact of sequence errors on de Bruijngraph-based sequence assembly approaches, a novel error correctionalgorithm is presented in this thesis, which aims at improving sequenceassembly result. This algorithm uses the suffix array built on the string of allthe reads and their reverse complements to find the overlaps among reads. Itthen uses the overlap information and performs multiple sequence alignmenton the overlapped reads to correct sequence errors. The test results indicatethat this error correction algorithm can help sequence assemblers a lot inimproving assembly result.In addition to the error correction algorithm, the thesis presents a dataclustering algorithm which can be used to cluster large NGS sequence readsand finally reduce the memory requirements of sequence assemblers. Thedata clustering algorithm makes it possible for de Bruijn graph-basedsequence assemblers to assemble very large genomes. This algorithm is basedon spaced seed indexing, and utilizes OpenMP to cluster reads in parallel. Thetest results indicate that this data clustering algorithm is able to reduce theredundancies in the sequence reads efficiently.
Keywords/Search Tags:next-generation sequence assembly, de Bruijn graph, suffix array, sequence error correction, spaced seed indexing, OpenMP, data clustering
PDF Full Text Request
Related items