Font Size: a A A

Tiger: Tiled iterative genome assembler and approximate multi-genome aligner

Posted on:2014-09-02Degree:Ph.DType:Dissertation
University:University of Illinois at Urbana-ChampaignCandidate:Wu, Xiao-LongFull Text:PDF
GTID:1450390005992445Subject:Bioinformatics
Abstract/Summary:
Sequence assembly and alignments are two important stepping stones for comparative genomics. With the fast adoption of the next-generation sequencing (NGS) technologies and the coming of the third-generation sequencing (TGS) technologies, genomics has provided us with an unprecedented opportunity to answer fundamental questions in biology and elucidate human diseases. However, most de novo assemblers require an enormous amount of computational resource, which is not readily available to most research groups and medical personnel. Moreover, there has been little progress on sequence assembly qualities, especially for genomes having high repetitions. As more affordable raw data and assembled genomes are accessible to the community, there is an emerging demand for genome searches among the big amount of divergent genomes in gene banks. The genomes can be in the form of raw reads, unfinished/low-quality assemblies, or completed genomes, on which traditional multi-sequence alignment tools may not be suitable to perform similarity searches. Yet there are few research studies aiming at meeting this demand. We have developed a novel de novo assembly framework, called Tiger assembler, which adapts to available computing resources by iteratively decomposing the assembly problem into sub-problems. Our method can flexibly embed different assemblers for various types of target genomes. Using the sequence data from a human chromosome, our results show that Tiger can achieve much better NG50s, better genome coverage, and slightly higher errors, as compared to Velvet and SOAPdenovo, using a modest amount of memory that is available in commodity computers today. We also experimented with a real de novo assembly, i.e., the E. mexicana genome, and demonstrated the strength of our work. The N50s of our contigs and scaffolds by Tiger were 7 and 57 times longer than those by SOAPdenovo. On the other hand, the assembly done by ALLPATHS-LG had only one-third genome size. We also developed a multi-genome sequence aligner, called Tiger aligner, able to perform fast similarity checks among multiple genomes with distant biological relationship and low quality raw data. Practical applications of our tool are demonstrated through experiments. The performance of Tiger aligner on traditional multi-sequence alignments is also compared against existing works, MUMmer and SOAPaligner. The results show the practicality and strengths of our tool. Most state-of-the-art assemblers that can achieve relatively high assembly quality need an excessive amount of computing resource (in particular, memory) that is not readily available to most researchers. Tiger assembler provides the only known viable path to utilize NGS de novo assemblers that require more memory than that is present in available computers. Evaluation results demonstrate the feasibility of getting better quality results with low memory footprint and the scalability of using distributed commodity computers. The quantity explosion of genomes makes existing multi-sequence aligners impractical to check similarities among genomes with different characteristics in terms of evolutionary relationship and sequence completeness. Current pairwise sequence aligners cannot cope with them without big revisions because of the inherently algorithmic limitations. Tiger aligner is the first known work invented to deal with the multi-genome problems, leveraging the feature-based image recognition idea.
Keywords/Search Tags:Tiger, Genome, Aligner, Assembly, Sequence, Assembler, De novo
Related items