Approaches to scaling and improving metagenome sequence assembly

Posted on:2014-09-05

Degree:Ph.D

Type:Dissertation

University:Michigan State University

Candidate:Pell, Jason

Full Text:PDF

GTID:1452390008959453

Subject:Biology

Abstract/Summary:

Since the completion of the Human Genome Project in the early 2000s, new high-throughput sequencing technologies have been developed that produce more DNA sequence reads at a much lower cost. Because of this, large quantities of data have been generated that are difficult to analyze computationally, not only because of the sheer number of reads but due to errors. One area where this is a particularly difficult problem is metagenomics, where an ensemble of microbes in an environmental sample is sequenced. In this scenario, blends of species with varying abundance levels must be processed together in a Bioinformatics pipeline. One common goal with a sequencing dataset is to assemble the genome from the set of reads, but since comparing reads with one another scales quadratically, new algorithms had to be developed to handle the large quantity of short reads generated from the latest sequencers. These assembly algorithms frequently use de Bruijn graphs where reads are broken down into k-mers, or small DNA words of a fixed size k. Despite these algorithmic advances, DNA sequence assembly still scales poorly due to errors and computer memory inefficiency.;In this dissertation, we develop approaches to tackle the current shortcomings in metagenome sequence assembly. First, we devise the novel use of a Bloom filter, a probabilistic data structure with false positives, for storing a de Bruijn graph in memory. We study the properties of the de Bruijn graph with false positives in detail and observe that the components in the graph abruptly connect together at a specific false positive rate. Then, we analyze the memory efficiency of a partitioning algorithm at various false positive rates and find that this approach can lead to a 40x decrease in memory usage.;Extending the idea of a probabilistic de Bruijn graph, we then develop a two-pass error correction algorithm that effectively discards erroneous reads and corrects the remaining majority to be more accurate. In the first pass, we use the digital normalization algorithm to collect novelty and discard reads that have already been at a sufficient coverage. In the second, a read-to-graph alignment strategy is used to correct reads. Some heuristics are employed to improve the performance. We evaluate the algorithm with an E. coli dataset as well as a mock human gut metagenome dataset and find that the error correction strategy works as intended.

Keywords/Search Tags:

Metagenome, Sequence, De bruijn graph, Reads, Assembly

Related items

1	Combinatorial methods in computational genomics: Mammalian phylogenetics using microinversions and fragment assembly with short reads
2	Research On Generation And Optimization Of Assembly Sequence Based On Process Model
3	Assembly Sequnece Planning And Simulation Based On The Directed Graph
4	Research On Optimization Of Hull Block Assembly Sequence Based On Multiple Colony Ant Colony Algorithms
5	Research On The Techniques Of Modeling And Sequence Planning For Digital Pre-assembly
6	Study On The Method Of Assembly Sequence Optimization For Typical Mechanical Products In The Virtual Assembly Environment
7	Research On Automatic Generation Of Assembly Sequence Of Shin Block
8	Research On Virtual Assembly Sequence Generation Method Based On Geometric Constraint
9	Research On Assembly Sequence Planning And Evaluation Of Body-in-white
10	Virtual Assembly Sequence Planning Study