Font Size: a A A

Parallel Optimization For Whole Genome Re-sequencing Sequences Analysis Pipeline

Posted on:2015-10-04Degree:MasterType:Thesis
Country:ChinaCandidate:H WangFull Text:PDF
GTID:2310330509960574Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of genome sequencing technology, biological sequence databases continue to grow exponentially. However, current whole-genome sequence analysis software pipeline is not that efficient to meet the needs of timely genome sequence data processing. We analyzed and optimized the pipeline to obtain significant improvements.Firstly, we tested the most widely used assembler SGA and found that the construction of BWT index for a large collection of reads represented more than 75 percent of the total runtime. We introduced a new algorithm to construct the BWT for a large collection of reads on multiple processors. And we proposed a pruning strategy to avoid unnecessary sorting procedure so as to gain 26 X speedup compared to BCR. Our efficient sotfware,named BWTCP, was tested on Tianhe-2 and managed to build the BWT of one billion reads of length 100 within half an hour. Furthermore, current BWT builders are sensitive to the read length while BWTCP addressed this concern by the novel pruning strategy.For the step of sequence alignment, we developed a fast and accurate short-read aligner named MICA, along with HKU-BGI Bioinformatics Algorithms and Core Technology Research Laboratory in Department of Computer Science, University of Hong Kong. MICA was developed based on the features of Intel MIC and Tianhe-2 so as to take full advantages of hardware performance. MICA can be scaled up linearly. We invoked MICA on 932 computing nodes on Tianhe-2 to align 17.4TB DNA sequences and it cost less than an hour, while one needs to spend three months on general severs equipped with 12-Core CPUs to finish the same task.A robust RNA Editing detecting model has been long needed to provide high-confidence RNA Editing sites. We introduced such a model on basis of High-Throughput Sequence alignment technology. In this model, we analysed four events resulting in RNA-DNA differences and presented an formula representing the probability of RNA Editing by means of Bayesian Theory of Probability. Experiments show that our model gain a confidence improvement of 18% compared to current ways.
Keywords/Search Tags:Whole-genome Sequence Analysis, High-Throughput Sequencing, Genome Assembly, BWT, Tianhe-2, Sequence Alignment, RNA Editing
PDF Full Text Request
Related items