Font Size: a A A

Several Models From K-word Of Non-frequencies On DNA Sequences Comparison And Their Applications

Posted on:2014-11-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:X W YangFull Text:PDF
GTID:1260330425477247Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
The incredible mass of data on biomolecular sequence calls for efficient computational methods of biological sequence analysis. A typical approach to sequence comparison is alignment-based method. However, alignment-based method encounters difficulties in dealing with large database and chossing scoring schemes. To overcome these two critical limitations, many alignment-free methods have been proposed. Among all the alignment-free methods, the model of k-word frequencies may be the most well-developed method.Most of the alignment-free methods based on k-word tended to have a bias toward indivdual k-word that view word frequencies as discrete units separately, in spite of their correlations and bulk property. And distance spaces are determined by the number and diversity of sequences. As a result, it is very hard to know which one is the smallest nonzero distance. Accordingly, we cannot tell the degree of similarity between two sequences when a single distance is given. To conquer the two problems mentioned above, in chapter2, we proposed a novel statistical distance for sequence comparison on the basis of k-word counts. Relationships among all the4k k-word counts were considered in each sequence for a fixed k. And a sequence of k-word rank orders was regarded as a new feature with respect to the whole genomic sequence.This new distance removed the influence of sequences’length and uncovered bulk property of k-word in DNA sequences. The proposed distance was tested by similarity search and phylogenetic analysis. The experimental assessment demonstrated that our similarity distance was efficient.Originating from sequences’length difference, both k-word based methods and graphical representation approaches have uncovered biological information in their distinct ways. However, it is less likely that the mechanisms of information storage vary with sequences’ length. A similarity distance suitable for sequences with various lengths will be much near to the mechanisms of information storage. In chapter3, we established a novel alignment-free sequence comparison method suitable for biological sequences with various lengths. New sub-sequences of k-word were extracted from biological sequences under one-to-one mapping. The new sub-sequences were evaluated by linear regression model. Moreover, a new distance was defined on the invariants from linear regression mode. With comparison to other alignment-free distances, the results of four experiments with different sequences’lengths demonstrated that our distance was more efficient. Though many alignment-free similarity distances have been proposed on the basis of k-word, most of the distances note k-word counts and ignore k-word position information. In fact,k-word positions incorporate important information of gene rearrangements, inversion, transposition, and translocation. In chapter4, a similarity distance consisting of positional information and bulk property of k-word was proposed. Phylogenetic analysis on3data sets by this distance demonstrated its efficiency.Last but not least, for each similarity distance proposed in all three chapters, further analysis revealed optimal value of k-word length in each data set.
Keywords/Search Tags:Alignment-free, Sequence Comparison, Sequence Similarity Search, Phylogenetic Analysis, Phylogenetic Tree Construction
PDF Full Text Request
Related items