Font Size: a A A

Processing And Application Of RNA-seq Data

Posted on:2013-01-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:L K WangFull Text:PDF
GTID:1110330371482687Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, second-generation sequencing technologies have rapidly developed andhave become essential in genome and transcriptome studies. Large amounts of short readscan be generated within a relatively shorter time by second-generation sequencing platforms.High-throughput RNA sequencing (RNA-seq) is an important application ofsecond-generation sequencing technology. This technology sequences complementary DNAssynthesized from RNAs with the use of second-generation sequencing platforms in order tostudy transcriptome. RNA-seq can be used to estimate the expressions of genes or isoforms,detect differentially expressed genes, and determine novel splice junctions.This study focuses on the processing and the application of RNA-seq data. The maincontents are as follows:1) Mapping of spliced reads to the reference genome. Early second-generationsequencing platforms, such as Illumina GA platform, can only generate relatively shorterreads. Furthermore, most software tools only consider consecutive alignments when theyalign reads to reference genomes. However, with the development of sequencing technology,the reads generated by RNA-seq technology can now exceed100bp. A large number ofreads may span one or more splice junctions and cannot be mapped to genomesconsecutively. Therefore, an alignment of long reads that allows spanning introns isnecessary for mapping tools. Studies have developed splice junction databases according toknown gene annotations. From the databases, reads spanning known junctions can bealigned to the reference genome with the use of traditional tools. However, this method reliesheavily on gene annotations, and reads spanning unreported junctions cannot be alignedcorrectly. Therefore, several tools were developed to align spliced reads without dependingon gene annotations. This study discusses the tools used for alignment and their respective advantages and disadvantages. Moreover, a new algorithm was proposed, and a new packagecalled SeqSaw was developed to map spliced reads to reference genomes. Aside from usingspace seeds to accelerate mapping, SeqSaw also utilizes static and dynamic hash tables toalign spliced reads separated by introns. The use of SeqSaw can dramatically reducesearching space and can improve the sensitivity of mapping results. SeqSaw does not rely ongene annotations and can align the reads spanning unreported spliced junctions. Therefore,the alignments that span unreported spliced junctions can be used to predict novel splicedjunctions.2) Detection of differentially expressed genes. Mapping results can be processed toestimate the expression of genes and isoforms and to detect differentially expressed genes orisoforms. This study discusses the methods used to estimate the expression of genes andisoforms. An R package was introduced to detect the differentially expressed genes. Thepackage integrates two methods that were newly developed. The first method is based onrandom sampling model. The second method estimates background noise using technicalreplicates. The two methods use MA-plot to detect and view differentially expressed genesbetween two samples. Three other methods were also integrated into the package. Thepackage was named DEGseq and was uploaded to Bioconductor. The methods were used ontwo groups of liver and kidney samples to detect differentially expressed genes. The similarand different genes detected were then analyzed.3) Genome-wide detection of novel splice junctions. The identification of intron/exonboundaries is a challenging task, especially for non-canonical junction boundaries. Beforehigh-throughput sequencing technologies were developed, sequencing expressed sequencetags was the most popular method to detect splicing events on mRNA. However, sequencingexpressed sequence tags is expensive and labor intensive. Depending on the predefinedcombinations of known exons, high-throughput exon/junction arrays can also be used todetect novel splicing events. However, the arrays cannot detect the splicing events ofun-annotated exons. As high-throughput sequencing technologies develop, larger amounts ofreads can be generated within relatively shorter times and at lower costs. Sequence resultscan be used to detect the boundaries of exons and introns and to determine the splicejunctions in single-base resolution. Based on the alignment of spliced reads, this workpredicted raw splice junctions by aggregating aligned spliced reads. For each potential splicesite, several realizable factors were calculated and used to balance sensitivity and specificity.When the method was used on real RNA-seq data sets, higher specificity and sensitivity values were reached compared with other algorithms. SeqSaw was used on two RNA-seqdata sets, and the influence of sequencing depth on the detection of novel junctions wasexamined. Distribution of the detected novel junctions on the genome and the differentialusage of the splice junctions between samples were also analyzed.In summary, a new method for mapping spliced reads to reference genomes was proposed.The method can align spliced reads to the reference genome with high sensitivity and doesnot depend on gene annotations. From the mapping information, the methods that were usedto estimate the expressions of genes or isoforms were discussed. Moreover, two methodswere proposed to identify the differentially expressed genes from the number of reads. Themethods were integrated in an R package named DEGseq. Finally, a new method wasproposed to predicted splice junctions and was used on two data sets. The proposed methodcan achieve higher specificity and sensitivity values compared with other methods.
Keywords/Search Tags:RNA sequencing, splice junctions, sequence alignment, differentially expressed genes, alternative splicing
PDF Full Text Request
Related items