Font Size: a A A

Several Probabilistic Models For Analysis Of Biological Sequences And Their Applications

Posted on:2012-08-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:G S ChangFull Text:PDF
GTID:1220330368985845Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
The advances in biological sequencing technologies and the implementation of the Human Genome Project have generated an overwhelming amount of sequence data. The large mount of sequence data produces new questions such as how to analyze, process and store these data, which are serious challenges to Computer sciences, Mathematics, and so on. Thus, Computa-tional Biology emerges as a new and developing interdiscipline. The research area of compu-tational biology is mainly to analyzing the information involved in the biological sequence. It has been realized that the traditional methods based on multiple sequence alignments are not suitable for large sequence data because of the fundamental and computational limitations, such as the difficulty of searching for optimal solutions. Consequently, considerable efforts have been made to research alternatives, i.e., alignment-free, methods for sequence comparison. Alignment-free methods proposed in recent years have been very important topic in computational biology. Throughout this study, the following research objectives including some probabilistic models involved in computational biology will be addressed:In chapter 2, we propose the weighted relative entropy based on the Markov chain model about DNA sequence. Beacause the main character of Markov chain is involved in its transition probability matrix and its initial probability distribution, the weighted relative entropy is derived from two transition probability matrixes and two initial probability distributions. We take the weighted relative entropy as a simple measure of distance between genomic sequences. In the same time, we validate this measure by using it on similarity search. The weighted relative entropy is also proved to serve as an alternative method to rapidly construct phylogenetic trees of 48 HEV genome sequences.In chapter 3, the conditional multinomial distribution model is constructed based on the inter-nucleotide distance sequence. The relative error vector derived from the conditional multi-nomial distribution then can be used as a genomic signature that identifies each species. This approach allows us to perform comparative analysis between complete genome sequences. In fact, we propose a new evolutionary information representation,κ-multinomial composition vector(κ-MCV). Based onκ-multinomial composition vector, we introduce the conditional multinomial complete composition vector. The proposed method is tested by phylogenetic anal-ysis on twenty four coronavirus genomes. Our results demonstrate that the new method is powerful and efficient.In chapter 4, we introduce a new sequence distance for efficient reconstruction of phy- logenetic trees based on the distribution of length about common subsequences between two protein sequences. Intuitively, the longest the commom string between two sequences, the more similar the two sequences are. The distribution of length about common subsequences are de-rived from the idea of all common subsequences from two sequences. To measure the ablility of extracting information about the distribution of length about common subsequences between two sequences, we tested the method by phylogenetic analysis on the data about 24 transferrin sequences from vertebrates. These results demonstrate that the new method can extract more information from the protein sequence.
Keywords/Search Tags:Markov model, Relative entropy, Weighted relative entropy, Geometry distribu-tion, Conditional multinomial distribution, Longest common substring, Harmonic distribution of common substring
PDF Full Text Request
Related items