Font Size: a A A

Algorithm Studies Of Transcriptome Assembly Basad On Next Generation RNA-seq Data

Posted on:2021-02-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:T YuFull Text:PDF
GTID:1360330602980910Subject:Operational Research and Cybernetics
Abstract/Summary:PDF Full Text Request
With the rapid development of biotechnology,the growth of biological data is explosive.At the same time,computer and internet technology is increasing day by day,which makes the storage,processing and transport of big data become possible.So,biological data mining has become one of the most important part in life science research.Bioinformatics,a new interdisciplinary subject based on computer science,mathematics and biology,comes into being,in which the study of transcriptomics is a very important and fundamental problemIn recent years,the next-generation RNA-seq technology has gradually become an essential and powerful tool for transcriptome analysis depending on its advantages of high throughput and low cost.And with the continuous generation of huge biological data,next-generation sequencing technology is now used more and more widely in the study of gene expression.The read lengths of next-generation sequencing are typically quite short,so we must assemble them into full-length transcripts,which is the main task of this paper.In eukaryotic organisms,alternative splicing is a common posttranscriptional process and is a crucial mechanism for gene regulation,by which multiple distinct functional transcripts can be generated by a single gene,and the type of alternative splicing is different,which makes transcriptome assembly more challenging.The main study of this paper is to build mathematics model of the transcriptome assembly problem,solving the bottlenecks of this problem by using combinatorial optimization theoryThere are two main approaches assemble transcriptome:genome-guided and de novo.Genome-guided approaches take advantage of an existing genome to which the RNA-seq reads are first aligned by using a mapping tool.And then splicing graphs are built from the mapping results,based on which to recover individual transcripts.However,the genome of most species are unknown,then the de novo approaches will be the only choice,which reconstruct full transcripts directly from RNA-seq reads,without aligning the reads to a genome.Both approaches are important and can’t be replaced by each other We tested several state-of-the-art assemblers of both approaches on simulated and real datasets,and found that their performance was quite unsatisfied in both effectiveness and efficiency,which limits their applications in practice,therefor higher-quality algorithms are urgently neededBased on the above considerations,we proposed a novel genome-guided transcriptome assembler iPAC,which effectively solve the bottleneck of assembly problem and make up for the defect of current algorithms.We tested iPAC on both simulated and real datasets,and made a comprehensive comparison with other mainstream assembly algorithms.The results demonstrate that:on all the test data,the assembly results of iPAC maintain highest sensitivity and highest precision,and greatly reduce the number of false positive transcripts.At the same time,iPAC is significantly powerful in recovery of lowly expressed transcripts.we conclude that it is to some extent superior to all the salient assemblers of the same kind.The advantages of iPAC may be attributed to 1)the overlap graph of paired paths,followed by a newly designed technique for iteratively extending paired-paths,leading to a quite effective use of paired-end information;2)a novel phase graph model is constructed.iPAC makes full use of the sequencing depth information to determine a reasonable connection between the incoming and outgoing edges of each node in the splicing graph by solving a series of quadratic programming problems.The phase graph is then updated by combining the extended paired-paths generated in the overlapping graph,which fully integrate the paried-end sequencing information with the sequencing depth information,which effectively solve the key difficulty that is the ambiguities in linking in-and out-splicing junctions at each exon with multiple splicing junctions.3)the newly developed technique for extracting all the transcript-representing paths over the phasing graphs that are guided by the edge weights on the phasing graphs.Though iPAC has excellent performance,it also has some shortcomings.First,the code of iPAC does not achieve parallelization,so there is still possible to improve its computing efficiency.The realization of parallelization is of great significance to the improvement of computing efficiency,which will be a direction of our future efforts.Second,after the completion of iPAC assembly,other tools are used to estimate the expression levels of assembly results.In the future,we will design our own expression level estimation module.In this paper,we also introduce a novel de novo assembler TransLiG.From our tests,TransLiG has significant advantages in both sensitivity and precision over the current mainstream reassembly algorithms.At the same time,the ability to reconstruct low-expression transcript is better than other algorithms.TransliG algorithm has the following innovations:(1)in TransLiG algorithm,a relatively longer kmer was used to construct the initial splicing graph,and then a shorter kiner was introduced to modify the splicing graph.The longer kmer can effectively reduce the wrong connection in splicing graphs,and the shorter kmer modifying the splicing graph may reduce the fragmented sequences,which makes the final splicing graph more reliable.(2)a new quadratic programming model was introduced to skillfully integrate sequencing depth and paired-end sequencing information.(3)TransLiG iteratively constructed line graphs starting from splicing graphs,then recovered all the transcripts by expanding all the isolated nodes generated during the line graph iteration,by which TransLiG got the global optimal solution.iPAC and TransLiG have been implemented by C++and freely available from:iPAC:http://sourceforge.net/projects/transassembly/filesTransLiG:https://sourceforge.net/proje c ts/transcriptomeassembly/files/...
Keywords/Search Tags:Second-generation sequencing data, Alternative splicing, Transcriptome reconstruction, Assembly algorithm
PDF Full Text Request
Related items