Font Size: a A A

Study On Gene Non-coding Regin And Motif Discovery Based On Bayesian Statistics

Posted on:2015-03-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q LiuFull Text:PDF
GTID:1220330431462421Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Genome research has led to a rapid growth of genome sequencing data. Theanalysis of the huge amount of DNA sequence analysis becomes one of the essentialtask for the scientists. How to identify the motif in gene non-coding region is the mostchallenging problem in this area. The research on the motif is an important problembecause motif identification is the key to understand the mechanism of genetranscription and expression. Bayesian statistics with priori information and posteriordistribution has been introbudced by lots of biologists in this analysis to deal with thehuge amount of DNA data.In this thesis, we focus on the sequence analysis of non-coding region and motifdiscovering in Bioinformatics with Bayesian statistical methods. The main works arelisted as follows:1.In order to model non-coding background sequences, a method analysingcontext dependency is developed based on Bayesian hypothesis testing. Themultinomial distribution is given, whose prior distribution is Dirichlet with Jeffreyshyper-parameter. The Bayesian hypothesis testing technique can be applied on discreteMarkov chains to get a test for Markovianity. The advantage on choosing higher-orderMarkov chain model and the method to select the proper order for non-codingbackground sequences are given. Minimum first order context dependence inherent inten genes groups of yeast S.cerevisiae has been found. Thus the Markov chain withhigher-order would be more suitable for modeling the non-coding backgroundsequences than an independence model.2.For the significant testing of motif in biological sequences, an improvedBayesian hypothesis testing is presented. This testing is converted to the goodness of fittest of the multinomial distribution. Based on Bayesian Theorem, a Bayes factor isobtained, which acts as statistical estimation of the significance. The method overcomesthe difficulty of constructing the test statistics and deriving its exact distribution on thenull hypothesis. In order to estimate the parameters of Dirichlet prior distribution of themultinomial distribution, two methods are given using moment estimation andmaximum likelihood estimation based on Newton-Raphson algorithm for maximizationof the predictive distribution of the data. Taking Pearson productmoment correlationcoefficient as an objective criterion of the quality estimation, experimental resultsindicate that Bayesian testing performs better on average than the classical methods.3.Moitfs are commonly modeled using position frequency matrices. To compare position frequency matrices representing binding sites to one another, we propose toidentify and group similar profiles using Bayesian hypothesis testing between positionfrequency matrices, describing a column-by-column method for position frequencymatrice similarity quantification based on Bayes factor and posterior probability of thenull model that the aligned columns are independent and identically distributedobservation from the same multinomial distribution. The experimental studies usingboth real promoter sequences and simulated data prove very competitive with and evenbetter than the other classical methods on average.4.In order to exploit the dependency in binding sites base positions to aid motifdiscovery, a new Bayesian scoring function and a Gibbs sampling algorithm arepresented. By assuming independency between binding sites base positions, most of theavailable tools for unknown binding site prediction are designed. However recentbiological experiments suggest that there exists interdependency among positions in thebinding sites. Thus, firstly, we extend the position weight matrix model and thedinucleotide position frequency matrix is obtained whose each entry shows the numberof occurrences of a pair of nucleotides in a pair of positions. Secondly, we creat theBayesian scoring function whose hyper-parameters are given using maximumlikelihood estimation based on the transcription factor binding site matrices contained inthe JASPAR database. Finally, a greedy strategy for choosing the initial parameters ofdinucleotides position frequency matrix is employed. Site sampler is used to find oneoccurrence per sequence of the motif in the dataset for searching the alignment withmaximum score. We evaluate our new Bayesian scoring function on the real andsimulation datasets and the results show the proposed algorithm improves unknownbinding site discovery and performs better than some methods that do not considerdependency.
Keywords/Search Tags:Bioinformatics, Bayesian Statistics, Statistical significance, Similarity analysis, Motif identification
PDF Full Text Request
Related items