Font Size: a A A

De Novo Transcriptome Assembly From RNA-seq

Posted on:2015-01-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z ChangFull Text:PDF
GTID:1260330431455175Subject:Operational Research and Cybernetics
Abstract/Summary:PDF Full Text Request
Bioinformatics is a burgeoning interdisciplinary scientific field that employs technologies in mathematics, information theory, statistics and computer science to solve problems in biology, particularly in molecular biology. One of the most important and challenging problem in this field is de novo transcriptome assembly, which is to assemble all expressed transcripts in a transcriptome using sequencing data. In this thesis, we will talk about how to de novo reconstruct all transcripts in a complex eukaryotic transcriptome using a classic combinational optimization algorithm, which will be benefit for the study of diseases that related to alternative splicing, especially for cancer research.With the development of next generation sequencing, RNA-seq has been a powerful tool for transcriptome analysis. However, it also proposes unprecedented challenges for transcriptome assembly using RNA-seq data. The existing algorithms of transcriptome assembly fall into two general categories:reference-based and de novo assembly approaches. Although the reference-based methods perform better than de novo methods, they are used only when a high-quality reference genome is available. In fact, most of organisms do not have such a known genome, do novo assembly, which is more computationally challenging than reference-based assembly, provides a solution in this situation. Several de novo assemblers have been developed by now, but none of them performs very well.In this thesis, we analyze all assembly methods and present a new de novo assembler Bridger that takes advantage of techniques used in the reference-based assembler Cufflinks to overcome limitations of the existing de novo assemblers. When tested on dog, human and mouse RNA-seq data, Bridger assembled more full-length transcripts while reporting considerably fewer candidate transcripts, hence greatly reducing false positive transcripts in comparison with the state-of-the-art de novo assemblers. In addition, it runs substantially faster and requires less memory space than other assemblers. More interestingly, Bridger reaches a comparable level of sensitivity and accuracy in comparison with the popular reference-based assemblers such as Cufflinks.Bridger is an innovative algorithm because of the following reasons:(i) Instead of using the popular de Bruijn graph, it constructs a splicing graph for each gene encoded in the genome based on the given RNA-seq data. Splicing graph provides a natural and lossless representation of all the splicing isoforms at each gene locus.(ii) Paired-end reads are used when constructing splicing graphs to help obtain a more complete graph and to control the size of the graph which facilitates the process of searching transcripts from the graph.(iii) By introducing an auxiliary graph junction graph, a classic combinatorial optimization model--minimum path cover is successfully applied to de novo assembler to reduce the false positive rate of assembled transcripts.(iv) By adding weights to the model, the sequencing depth information is subtly used to improve the accuracy of de novo assembly. As far as our known, this is the first time that the sequencing depth information is successfully incorporated into a de novo assembler.Though Bridger has a lot of advantages, it also has two shortcomings. First, the implementation of Bridger needs to be further optimized and parallelization is necessary in the step of constructing splicing graphs. Second, the minimum path cover model cannot always work, it performs not very well in some cases, when some tricks can be employed to overcome this limitation.Two examples are given to exhibit the great value of the new de novo assembler Bridger in application. In one example, Bridger was applied to lung adenocarcinoma data and found two alternative splicing transcripts of one oncogene and their differential expressions in different samples.In another example, Bridger was applied to dog RNA-seq data and discovered many novel genes and transcripts which have not been annotated in current dog genome. In the end, we describe the downstream analyses of transcriptome assembly and also propose several related topics as our future research directions.Bridger is implemented in C++and the source code is publicly available from: https://sourceforge.net/projects/rnaseqassembly/files/?source=navbar.
Keywords/Search Tags:Bioinformatics, Alternative splicing, Next generation sequencing, Transcriptome assembly, Minimum path cover
PDF Full Text Request
Related items