Font Size: a A A

Research On DNA Assembly Algorithms Based On The Statistical Model

Posted on:2013-03-30Degree:MasterType:Thesis
Country:ChinaCandidate:D T HanFull Text:PDF
GTID:2250330392469498Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Sequencing remains at the core of genomics. Recently, new sequencingtechnologies have emerged. Compared to traditional Sanger methods, thesetechnologies produce shorter reads which have a large number and greater coverage.However, very short reads are not well suited to this traditional overlap-layout-consensus approach and new method is still to be improved. Therefore we introducea new DNA assembly algorithm based on the statistical model.Three error correction methods are introduced because there are a lot of basesequencing errors in the reads, so it is necessary to use the correction software tocorrect errors before for increasing the accuracy of the algorithm. The DNAassembly algorithm based on the statistical model overcomes the originalalgorithm’s fault that over-reliance on overlap between fragments. DNA assemblyis understood as the Second-order discrete Markov process, and each base fragmentis abstracted as a state of the system. The algorithm builds a probabilistic model tostorage the State sequences and all the transition probabilities. Then given twoprecursor state, the next best state can be determined by the maximum transferprobability. Finally, using the best state to update the precursor state and the lengthof current state sequence will be expanded persistently by repeating the ab oveprocess. We get a long contig when there is no the maximum transition probability,the algorithm can produce a number of contig. However, There will be no suffix,repeat, and error high incidence in the actual assembly process, which greatlyincreased the difficulty of DNA assembly. In this paper, a series of heuristic rulesto optimize the algorithm to solve the above-mentioned assembly problems.Compared the result of this method with SOAPdenovo and Velvet method onE.coli sequence data, contig number, total length, maximum length, the averagelength and time consumption obtained by the algorithm proves this algorithm hasbetter results.
Keywords/Search Tags:DNA assembly, de novo sequencing, probability model
PDF Full Text Request
Related items