Font Size: a A A

Applications Of Higher Order Markov Model To Phylogeny Reconstruction And Motif Finding

Posted on:2017-04-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:W F YangFull Text:PDF
GTID:1220330485964986Subject:Statistics
Abstract/Summary:PDF Full Text Request
Traditional methods of biological sequence analysis are based on the multiple sequence alignment. While the alignment approach has many limitations:There is no unified standard to choose a nucleotide or amino acid substitutions scoring matrix; Alignment is invalids for highly divergent sequences such as regulatory el-ements; Because of the high computation complexity, on the big data produced by those next-generation sequencing technologies, implementing sequence analysis method based on the multiple sequence alignment is impracticable. So in the post-gonomic era, there is urgent need of efficient alignment free methods for biological sequence analysis. Markov model is an important model to characterize stochastic process, it has a long history to be applied in biological sequence analysis. For ex-ample, Markov model was employed in many classical approach for identifying of CpG island and gene finding. However, the high order Markov model was rarely used in the past. In this thesis, we investigate the application of the high order Markov model in biological sequence analysis. The main results are the follows.1. Maximizing the Markov-Shannon entropy (MME) method for determination of the order of Markov model. The Markov model was widely used in biological sequence analysis, while the determination of the order was relatively rare explored. In general, the order was inferred by the x2-Statistic or Akaike information crite-rion (AIC) and Bayesian information criterion(BIC). Using higher order Model for comparison of biological sequences, we hope the information can be represented as much as possible. So we introduce the method of maximizing the Markov-Shannon entropy(MME). The test on multiple data sets shows the order identified by MME is higher than those by AIC and BIC. Moreover, the MME method has advantage over the AIC and BIC in the case of using high order Markov model for comparison of biological sequences.2. One-dimensional chaos game representation. The chaos game representation introduced by Jeffrey is a two dimensional graphic and one-to-one representation for DNA sequences, which is based on a function iteration system. It turns a DNA se-quence to a scatter graph in a unit square. By this way, in a sequence, the frequency specificity of all kind of polymer with different length is indicated by the density in the different subarea, and the base composition bias in different levels is revealed by the fractal characteristics of the scatter graph. The chaos game representation has been widely used in characterization of DNA sequence. However, the chaos game representation is a graphic representation tailed for DNA sequences, it is suitable for a sequence on an alphabet with k2 symbols at most, but an amino acid sequence has 20 different residues. One-dimensional chaos game representation is another one-to-one representation method based on a function iteration system similar as J-effrey’s chaos game representation, which can turn a sequence on any alphabet with finite symbol to a number sequence in a unit interval of one dimensional real axis. This representation is suitable for not only a DNA/RNA sequence, but all so an amino acid sequence, and even for English text sequence that consist of 26 letters. Except for the visualization, the one dimensional chaos game representation inherit all merits of the two dimensional chaos game representation. The inversion formula which turns a number in the unit interval to the related prefix sequence and the structural index representing k-string is introduced, and the relationship between this representation and the higher order Markov model is investigated. The critical tasks of applying higher order Markov models is to estimate the order and the large scale conditional probabilities. These properties of the one-dimensional chaos game representation contribute to complete this two tasks.3. Phylogeny reconstruction. The traditional approach for phylogeny recon-struction using biological sequences is gene alignment. According to a nucleotide or amino acid substitutions scoring matrix, the evolutionary distance is obtained from multiple sequence alignment, and under the molecular clock hypothesis, the phylogenetic tree can be reconstructed. Those genes are relative more conserved, for example,16S rRNA,18S rRNA. However, in general, the different gene trees are not consistent. Moreover, those alignment based approaches have their limi-tations. Then many efficient alignment free methods for phylogeny reconstruction were introduced. The widely used CVTree approach, uses frequencies of all A;-string with fixed length as feature vector to characterize genome or proteome sequence, where the background frequencies are produced from the high order Markov mod-el. Inspired by this, we employ high order Markov model to directly characterize genome or proteome sequence, and use the transition matrix as feature vector to characterize whole genome or proteome sequence. Where the order of the Markov model is determined by the new MME method. The results on multiple data sets show that the present method is a very effective.4. Motif finding. The gene is the elementary unit of inherited information in DNA sequence, while the gene transcription and the gene expression are regulated by the transcription factors, which activate or inhibit the transcription machinery by binding regulatory element such as enhancer, promoter and silencer sequences. The binding positions are relative fixed and recurrent sequence pattern, which is called motif. Understanding the mechanisms that regulate gene expression is a ma-jor challenge in biology. Identifying regulatory elements, especially the binding sites in DNA for transcription factors is a major task in this challenge. Inspired by the method introduced by Tompa et al., here, we introduce a new A;-string based method using high order Markov model. The background sequence set is characterized by high order Markov model. Under the high order Markov model of background, the expected frequency of each A;-string is determined, and then the relative deviation rate of each fc-string is obtained from the actual frequency and the expected fre-quency. Finally, from the relative deviation rate, we can infer a k-string is a motif sample or is from the random background. The results on multiple HT-SELEX data set imply the effectivity of the new method for Motif finding.
Keywords/Search Tags:chaos game representation, high order Markov model, one-dimentional chaos game representation, phylogenetic tree, Motif finding
PDF Full Text Request
Related items