Font Size: a A A

The Alignment-free Methods And Their Applications For Analysis Of Biological Sequences

Posted on:2009-12-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Z LiuFull Text:PDF
GTID:1100360272970739Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the rapid development of the mathematics and computer technologies and the continuous accumulation of the tremendous biological data, a new and active interdiscipline—Computational Molecular Biology comes into being. The research in computational molecular biology which has attracted plenty of computer scientists, molecular biologists, mathematicians and so on to devote to it, is mainly concerned with the problems involving the computational complex in the biological applications. Biological sequence analysis is the key content of the interdiscipline and the traditional methods for the analysis are chiefly based on alignment of the strings, while with the coming of the " post-genome" era, alignment-free methods of the sequence analysis as the complement and development of the alignment methods have become a hot research area of computational molecular biology. In this dissertation, we firstly simply review the alignment methods; secondly relatively systematically summarize the alignment-free methods and propose some new alignment-free methods; finally make the analysis for some species sequences using the novel methods. The main contents of this dissertation are listed as follows:Based on the vectors of L-tuple probabilities for biological sequences, we provide a novel distance measure-normalized Euclidean distance, and classify two sets of protein sequences-CK35 and SP86 according to protein secondary structures using the distance function. Further, we compare our method with other metrics and alignment methods via ROC (Receiver Operating Curve) analysis in order to assess the intrinsic ability of the methodology to discriminate and classify biological sequences and structures.Using L-tuples, we consider to construct three 8-components vectors and multivariate vectors for a DNA primary sequence, and by the different start positions of the sliding window, a set of related matrices are given. The normalized leading eigenvalues and Frobenius norm from the constructed matrices have been selected as the numerical characterizations. As applications, we compare the similarity and dissimilarity for exon 1 ofβ-globin genes belonging to eleven species; we simulate the search for similar sequences of a query sequence from a database of 39 library sequences by the multivariate vectors representations of DNA sequence; we reconstruct the phylogenetic trees of H5N1 avian influenza virus genomes.From the frequency and position of appearance of L-tuple in a biological sequence, we consider construction of a characteristic distribution of an L-tuple to reflect the biological information involved in the sequence. The graphs of characteristic distributions of dinucleotide GC for the coding sequences of the first exon ofβ-globin gene of eleven different species, and the construction of phylogenetic trees of twenty four coronavirus genomes, thirty four mitochondrial genomes and 40 G protein-coupled receptors illustrate the utility of the approach.
Keywords/Search Tags:L-tuple, Distance measure, ROC curve, Mitochondrial genome, Coronavirus, Avian influenza virus, Transmembrane proteins, Neighbor-joining method, Phylogenetic tree
PDF Full Text Request
Related items