Parallel Optimization For Whole Genome Re-sequencing Sequences Analysis Pipeline

Posted on:2015-10-04

Degree:Master

Type:Thesis

Country:China

Candidate:H Wang

Full Text:PDF

GTID:2310330509960574

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of genome sequencing technology, biological sequence databases continue to grow exponentially. However, current whole-genome sequence analysis software pipeline is not that efficient to meet the needs of timely genome sequence data processing. We analyzed and optimized the pipeline to obtain significant improvements.Firstly, we tested the most widely used assembler SGA and found that the construction of BWT index for a large collection of reads represented more than 75 percent of the total runtime. We introduced a new algorithm to construct the BWT for a large collection of reads on multiple processors. And we proposed a pruning strategy to avoid unnecessary sorting procedure so as to gain 26 X speedup compared to BCR. Our efficient sotfware,named BWTCP, was tested on Tianhe-2 and managed to build the BWT of one billion reads of length 100 within half an hour. Furthermore, current BWT builders are sensitive to the read length while BWTCP addressed this concern by the novel pruning strategy.For the step of sequence alignment, we developed a fast and accurate short-read aligner named MICA, along with HKU-BGI Bioinformatics Algorithms and Core Technology Research Laboratory in Department of Computer Science, University of Hong Kong. MICA was developed based on the features of Intel MIC and Tianhe-2 so as to take full advantages of hardware performance. MICA can be scaled up linearly. We invoked MICA on 932 computing nodes on Tianhe-2 to align 17.4TB DNA sequences and it cost less than an hour, while one needs to spend three months on general severs equipped with 12-Core CPUs to finish the same task.A robust RNA Editing detecting model has been long needed to provide high-confidence RNA Editing sites. We introduced such a model on basis of High-Throughput Sequence alignment technology. In this model, we analysed four events resulting in RNA-DNA differences and presented an formula representing the probability of RNA Editing by means of Bayesian Theory of Probability. Experiments show that our model gain a confidence improvement of 18% compared to current ways.

Keywords/Search Tags:

Whole-genome Sequence Analysis, High-Throughput Sequencing, Genome Assembly, BWT, Tianhe-2, Sequence Alignment, RNA Editing

PDF Full Text Request

Related items

1	Research On Genomic Sequence Alignment Methods Based On High-throughput Sequencing Data
2	A Design Of Short Gene Sequence Alignment Acceleration System Based On High Performance Hash Table
3	Research On Sequence Alignment Methods For The Third-generation Sequencing Data
4	Research On Genome Missembly Identification Method Based On High-throughput Sequencing Data
5	Assembling Of Klebsiella Pneumoniae Genome Based On High-throughput Sequencing Technology
6	Whole Microbial Genome Assembly And Analysis Based On Ion Torrent Sequencing Data
7	Research On Detecting Methods Of Indels In Next-generation Sequencing Data Of Human Genome And Establishment Of Detecting Platform For Indels In Genome
8	Research And Application Of The Third Generation Genome Assembly High Repeat Sequence
9	Research On Genomic Reads Mapping Based On De Bruijn Graph Model
10	Chloroplast Genome Assembly And Analysis Based On High-throughput Sequencing Of Mixed DNA Samples