Font Size: a A A

Mathematical Description Of The Biological Macromolecules And Its Applications

Posted on:2007-07-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:C LiFull Text:PDF
GTID:1100360182982426Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
With the completion/development of the genome projects of human and some model organism, the focus of biology shifts from accumulation of biological data to the analysis and interpretation of them, and thus bioinformatics, also named computational molecular biology, emerges as a new and developing inter discipline. The research area of bioinformatics is very wide, which includes sequence comparison, gene recognition by computers, molecular evolution and comparative genomics, RNA and protein structure prediction, codon origin and evolution of the genetic code, assembly of contigs, structure-based drug design, and so on. Most of them have a common requirement — the biological data must be transferred into a certain mathematical description, this leads to that the mathematical description of the biological macromolecules becomes a basic but very important topic in bioinformatics.The main contents of this thesis are listed as follows:In Chapter 1, we propose three kinds of graphical representations for biological sequences from different points of view. Firstly, we introduce a 3-D graphical representation of DNA primary sequences by taking four special vectors in a 3-D space to represent the four nucleic acid bases A, G, C, and T, respectively. Secondly, based on the characteristic sequences of a DNA primary sequence, we introduce two 2-D graphical representations of DNA sequences: one is the "two horizontal lines" graph, and the other is the "ladder-like" graph, each of which considers the sequences' structure as well as the chemical structure of DNA sequences. Finally, we introduce a directed graphical representation of biological sequences, which not only overcomes the serious drawback of the existing graphical representations, but also provides us with a new way of characterizing bio-sequences numerically.In Chapter 2, we propose a new sequence invariant named "ALE-index", which is based on norms of a matrix. The ALE-index can be regarded as an approximation of the leading eigenvalue, the currently most widely used invariant. Different from the leading eigenvalue, the ALE-index is very simple for calculation so that it can be directly used to handle long biological sequences. Therefrom, it becomes practicable to compare the whole genomes by the invariant-based sequence comparison method. Meanwhile, we find that the information reflected only by the leading eigenvalue might not be comprehensive in a special case. So we suggest, in this case, use the so-called "pseudo-trace" instead of the leading eigenvalue to characterize DNA sequences. Moreover, we describe a scheme that transforms the directed graph of a biological sequence into an upper triangular matrix, and investigate whether or not the existing sequenceinvariants are compatible for the upper triangular matrix representation. Finally, to reflect the information on elements of a sequence and, especially, the order relation among them, we construct a chain (totally ordered set) from a sequence of numbers, and then introduce the normalized relative-entropy. A potential application of a 12-component vector based on the normalized relative-entropy associated with a DNA sequence to discriminating protein coding and non-coding sequences in the yeast genome is briefly discussed.In Chapter 3, based on the ideas of homomorphism in algebra, we describe a DNA sequence in the way of coarse graining, and propose the logical representation (LR) for DNA primary sequences. Furthermore, we present a generalized LZ complexity for (0,1)-sequences. The examination of the similarity among DNA sequences of the full 6eia-globin genes of 11 species shows the utility of our approach. We also generalize the concept of the logical representation of DNA primary sequences to the protein primary sequences. Similarity and dissimilarity analysis based on the normalized relative-entropy of logical sequences of protein are given for eight protein sequences. Besides these, we introduce the shadow sequence for RNA secondary structure. By combining it with the symbolic sequence complexity, we compare RNA secondary structures of nine viruses.In the last chapter, based on the normalized relative-entropy of DNA sequences, we use the Fisher discriminant method to find protein coding genes in the yeast genome. Cross-validation tests demonstrate that the accuracy of the algorithm is 96%. The total number of protein coding genes in the yeast S. cerevisiae genome is estimated to be less than or equal to 5873, significantly coincident with the widely accepted range 5800-6000.
Keywords/Search Tags:Bioinformatics, Biological macromolecule, DNA, RNA, Protein, Graphical representation, Numerical characterization, Logical sequence, Shadow sequence, Sequence complexity, Sequence comparison, Gene recognition
PDF Full Text Request
Related items