| With the fast development of the next-generation high-throughput DNA sequencing technologies in recent years,the genomics and transcriptomics studies receive a great deal of technical supports and obtain unprecedented achievement.RNA-sequencing(RNA-seq),based on the next-genertation sequnecing techonologies,is rapidly becoming a standard and widely-used technology for the tran-scriptome analysis.RNA-seq enables scientists to reserach the gene activities or changes for organisms at different tissues,different time course and different conditions.Compared to microarray technology,RNA-seq does not depend on the existing gene information and can obtain almost all expressed tran-scripts for an RNA-seq experiment,while microarray needs the prior gene information to design the probes and consequently microarray cannot detect novel alternative splicing variants,transcripts and genes.In addition,RNA-seq has low background noise,the broader dynamic range of expression level,increased specificity and sensitivity.These advantages of RNA-seq have shown strong potential and abilities to substitute microarray technology for the transcriptome analysis.In the transcriptome study,the gene expression analysis mainly researches gene and the corre-sponding isoform expression level in transcription process.It is of great singificance to help people to understand the gene regulating mechanism and to prevent,diagnose and treat various diseases.Accord-ing the workflow of RNA-seq data analysis,the gene expression analysis includes two research fields,the expression level quantification and differential expression analysis.Therefore,the main work of this thesis focuses on the above two research fields,and the main contents are the following aspects.1 Modeling base-level bias to estimate gene and isoform expression Gene and isoform expression quantification is one of the most basic experiment purpose in RNA-seq data analysis,and still poses some challenges.In RNA-seq data,the biases lead to the non-uniform distribution of the reads along expression genes,which is a key factor for accurately estimating gene and isoform expression.For this problem,many methods consider and correct the non-uniformity to improve the estimation accuracy.We propose PBSeq,a Poisson model uti-lizing a base-level bias correction strategy to estimate gene and isoform expression.PBSeq adopts Poisson distribution to model the read counts of each base.The bias correction strategy uses two nonparametric models to estimate the positional and sequence-specific biases at starting position of reads mapped to the genes of interset.Then the bias values are merged into the Poisson dis-tribution as weights.We utilize a simulated dataset and several real RNA-seq datasets to validate the PBSeq model.Results show that PBseq can accurately estimate gene and isoform expression levels and is computationally efficient compared with other state-of-art methods.The PBSeq not only provides the expression values,but also estimates the uncertainty associated with expression estimation,which is useful for the downstream analysis.We take differential expression analysis as an example to show the usefulness of the expression measurement uncertainty in improving the downstream analysis.2 Mdeling exon-specific bias distribution to estimate gene and isoform expressionThe count variation patterns of read count are similar across multiple samples for each individ-ual gene.However,current methods usually separately deal with each single sample and rarely account for this similarity across multiple samples.Based on this feature,we propose Poisson-Gamma mixture model(PGSeq)to joint estimate expression level and exon-specific bias using multi-sample RNA-seq data.PGSeq adopts Poisson distribution to model the read counts and uses Gamma-distributed latent variables to model read sequencing preference for each exon.These variables are shared across multiple samples and are embedded to the rate parameter of a Poisson model to account for the overdispersion of read distribution.We use several real datasets and one simulated dataset to evaluate PGseq,and compare its performance with other popular methods.Results show that PGSeq presents the best performance compared to other alternatives in terms of accuracy in gene and isoform expression calculation and in the downstream differential expression analysis.Especially,we show the advantage of our method in the analysis of low expression.3 Considering expression measurement uncertainty to detect differential expressionDetecting differential expression is one fundamental objective in RNA-seq data analysis.Current methods rarely consider the expression measurement uncertainty.Moreover,most methods are only capable of detecting differential expressed genes,and few methods are able to detect differ-ential expressed isoforms.Therefore,we propose a Bayesian framework(BDSeq)to detect differ-ential expressed genes and isoforms with consideration of expression measurement uncertainty.BDSeq adopts two difference strategies to integrate the expression measurement uncertainty and leads to two models,the baisc model(BDSeqB)and the fast model(BDSeqF).Several real RNA-seq data sets are used to evaluate the performance of BDSeq and results show that the inclusion of expression measurement uncertainty improves accuracy in detection of differential expressed genes and isoforms.BDSeqB obtains more accurate results than BDSeqF.However,BDSeqF obviously improves computational efficiency compared to BDSeqB.4 The analysis pipeline of RNA-seq data.In order to facilitate users to use our methods,we design a user-friendly pipeline for RNA-seq data(UFP-RSeq).The analysis pipeline contains three modules,read alignment,expression level quantification and differential expression analysis,and can complete the gene expression analysis of RNA-seq data.Read alignment module adopts the most popular software,Bowtie.Expression level quantification module includes three our methods,GamSeq,PBSeq and PGSeq.Differential expression analysis module is composed of BDSeq and three count-based methods.According to users’ requirement and research goals,we provide some suggestions to help users choose the suitable methods and pathways.The codes and documentations of all methods in UFP-RSeq are freely available at the website http://parnec.nuaa.edu.cn/liux/UFP-RSeq.html.In conclusion,this thesis focuses on two research fields of gene expression study,expression level quantification and differential expression analysis.In expression level quantifiction,we gradually pro-pose GamSeq,PBSeq and PGSeq to correct the non-uniform read distribution problem.In differential expression analysis,we propose BDSeq to detect differential expressed genes and isoforms with con-sideration of expression measurement uncertainty.The results show that our methods obtain better computational accuracy and efficiency.For the convenience of users,we design UFP-RSeq analysis pipeline and provide corresponding advices to help users choose the suitable methos and pathways. |