Font Size: a A A

Parallel and Cloud Computing Based Genome Assembly using Bi-directed String Graphs

Posted on:2013-07-26Degree:M.SType:Thesis
University:The George Washington UniversityCandidate:Kumari, PritiFull Text:PDF
GTID:2450390008965150Subject:Biology
Abstract/Summary:
Background: Whole Genome Sequencing has been proven to be the one of the most powerful technology in the field of Genetics. It has found numerous applications in fields such as Plants genomics, microbial genomics to advanced Human genomics. And it has been proved to provide the most comprehensive collocation of an individual's genetic variants. Starting with Sanger sequencing, which dominated the industry for nearly two decades, now Whole Genome Sequencing has become more efficient with the advent of Next Generation sequencing (NGS). However, NGS has major limitation in the process of sequencing, which is caused by decrease in Read size. This limitation makes the genome assembly process with NGS data more complicated and dependent of high computational resources. The thesis involves comparison of two assemblers designed for assembling short NGS reads, which is based on a newer De Brujn Graph approach. These assemblers are called Velvet and Contrail. Velvet relies on large memory (RAM) for solving the assembly graph, whereas, Contrail relies on Hadoop Programming framework, for distributing the assembly process in parallel over several nodes. The research involves comparing the various assembly statistics which are obtained after running an assembly pipeline on a given dataset. The research also involves comparison between paired read sequencing and unpaired reads sequencing for the Velvet assembler.;Results: The first phase of analysis involved running assembly over a range of the algorithm parameters for the Kmer length 15-65 on a small set of data (2X coverage) using Velvet and contrail. It was observed that best assembly statistics was obtained by using Kmer size of 65. This Kmer value was then kept fixed for remaining of the experiments. The comparison between paired and unpaired read assembly on a small dataset using Velvet did not show significant difference. However, when applied to a comparatively bigger dataset, paired reads seemed to assemble better than unpaired reads. The comparison between Contrail and Velvet assembler on a small dataset showed that Velvet takes less time to complete. Also, Velvet provides better assembly quality. When the entire dataset of read coverage 192X and data size of about 70Gigabytes was assembled, Velvet failed to complete the assembly process. Contrail, on the other hand took about 240hours, but it did succeed to completion. When the assembly failed for Velvet on the entire dataset, the data was divided into half and then assembled again using Velvet. This time Velvet completed the process. However, Contrail showed much better assembly statistics.;Conclusion: This research abides by the fact that De Brujn Graph approach, definitely, is a more advanced, less complicated and reliable way to assemble short reads NGS sequences. It can be concluded from this research that the Kmer size parameter to use for assembling short reads should be about 65% its read size. At this length the assembly quality is the optimum. When it comes to deciding on which assembler to use, the size of the dataset should be taken into consideration. For a relatively smaller dataset, like those of microbial or small eukaryotic genome, Velvet would be a better option. Because Velvet loads the entire De Brujn Graph on the memory, assembling small microbial or eukaryotic genomes, will not require a large memory computer servers. However, if the dataset is of a mammalian genome, then Velvet would tend to fail, if a really large memory server (more than 1TB) is not used. Because, such servers are expensive and difficult to install, Contrail would be a better solution. Contrail runs on Hadoop, which distributes the assembly over several nodes. Installing and setting up Hadoop could also be expensive and difficult, but it can be rented from Cloud computing providers. Hence, Contrail would provide a simple and cost effective way for de novo assembly of shorts reads which are obtained from large genomes.
Keywords/Search Tags:Assembly, Genome, Using, Velvet, Sequencing, Reads, Graph, NGS
Related items