Sequence Assembly Algorithms For Next-generation Sequencing Technology Research

Posted on:2013-02-08

Degree:Master

Type:Thesis

Country:China

Candidate:C Y Chen

Full Text:PDF

GTID:2210330374462878

Subject:Biological Information Science and Technology

Abstract/Summary:

PDF Full Text Request

In recent years, the next-generation sequencing technologies have beenevolving rapidly. Compared to traditional Sanger sequencing which is costprohibitive, the next-generation sequencing (NGS) platforms are able togenerate enormous numbers of sequence reads rapidly at markedly reducedprice, which makes it possible to sequence more genomes. However, thelarge amount of sequence reads generated by NGS platforms are much higherin error rate and much shorter in length, which imposes a big challenge onsequence assembly. For this reason, it is important to develop effective dataprocessing methods specific to the large volume of error-prone reads fromNGS platforms to help genome assembly to bring the promise of wholegenome sequencing to fruition.In order to reduce the serious impact of sequence errors on de Bruijngraph-based sequence assembly approaches, a novel error correctionalgorithm is presented in this thesis, which aims at improving sequenceassembly result. This algorithm uses the suffix array built on the string of allthe reads and their reverse complements to find the overlaps among reads. Itthen uses the overlap information and performs multiple sequence alignmenton the overlapped reads to correct sequence errors. The test results indicatethat this error correction algorithm can help sequence assemblers a lot inimproving assembly result.In addition to the error correction algorithm, the thesis presents a dataclustering algorithm which can be used to cluster large NGS sequence readsand finally reduce the memory requirements of sequence assemblers. Thedata clustering algorithm makes it possible for de Bruijn graph-basedsequence assemblers to assemble very large genomes. This algorithm is basedon spaced seed indexing, and utilizes OpenMP to cluster reads in parallel. Thetest results indicate that this data clustering algorithm is able to reduce theredundancies in the sequence reads efficiently.

Keywords/Search Tags:

next-generation sequence assembly, de Bruijn graph, suffix array, sequence error correction, spaced seed indexing, OpenMP, data clustering

PDF Full Text Request

Related items

1	Studied On Gene Sequence Alignment Based On Mixed Suffix Tree And Suffix Array
2	Research Of Genome Data Compression Algorithm Based On Reference Sequence And Suffix Array
3	Research And Implementation Of Sequence Assembly Parallel Programming On Bi-directed De Bruijn Graph
4	A Parallel And Optimized Algorithms For De Novo Short Read Assembly Using De Bruijn Graphs
5	Research On De Bruijin Graph For DNA Sequence Assembly
6	Scaling short read de novo DNA sequence assembly to gigabase genomes
7	Research On Genomic Reads Mapping Based On De Bruijn Graph Model
8	The Implementation Of Metagenome Sequencing Assembly Based On De Bruijn Graph Algorithm
9	Ultra-large Multiple Sequence Alignment Based On Distributed Computing
10	Alignment-free Sequence Similarity Analysis And Clustering Algorithms On Biological Sequences