Font Size: a A A

Genome Assembly Guided By Reads

Posted on:2013-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:P L CengFull Text:PDF
GTID:2250330392967952Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Genome assembly is the core issue of bioinformatics, and assembling readsproduced by DNA sequencing can generate genome sequences. The emergence of nextgeneration sequencing has provided great aid for the life science research on majorissues, but at the same time, it brings genome assembly an unprecedented challenge dueto its data of mass, short length and relatively low precision, while traditionalalgorithms are no longer applicable. Development of sequence assembly software thatcould meet practical application has become the most important research topic.Firstly, this paper makes a brief introduction of next generation sequencing, suchas background, sequencing strategies and technology features, analyzes the mainchallenges of genome assembly, for instance, existence of many repeats, data of mass,short length and relatively low precision and investigates main genome assemblystrategies deeply, including greedy, overlap-layout-consensus and De Bruijn graph.Besides, this paper also sums up the advantages and disadvantages of differentalgorithms, and provides specific suggestions for future algorithms.Secondly, this paper proposes a new genome assembly guided by reads, regardingentire reads sequences as the basic assembly unit. This algorithm firstly invents ascoring mechanism based on accumulated assembly information and datacharacteristics. This algorithm is divided into two phases, reads assembly and contigsassembly. Reads assembly mainly consists of data preprocessing, De Bruijn graphconstruction and contigs generation.While contigs assembly includes determining therelative positions of contigs, overlap detection, contigs link and gaps filling, putsforward the concept of paired reads number PEN array and removes or corrects contigsend errors using sequence alignment methods to improve assembly quality.Finally, this paper introduces algorithm verification and performance evaluation.Several sets of data is choosed to test this algorithm software and Mauve AssemblyMetrics is used to compare the assembly results of this algorithm and other mainassembly software. After analyzing the evaluation results, this paper comes to aconclusion that our assembly algorithm performs well in both assembly length andassembly precision.
Keywords/Search Tags:bioinformatics, next-generation sequencing, genome assembly, reads, DeBruijn graph
PDF Full Text Request
Related items