Font Size: a A A

Research On DNA Signal Sequences Analysis For Gene Prediction

Posted on:2011-02-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:S GuoFull Text:PDF
GTID:1100360308969781Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Biotechnology is the most promising science areas in 21th century, Bioinformatics dedicates to interpret the genomic information, explore hidden patterns in genome and comprehensively understand life and their process in the end. The key to gene prediction is to interpret and understand genome sequence, namely, the identification of all functional units in the genome, including the encoding protein DNA fragments and other functional units. Because of biodiversity and a large variation in structure, the existing Bio-recognition algorithms have many problems with the accuracy, computation load and scope of application. To deal with the above problems, three aspects are studied as follows:1. The research on splice site prediction. The recognition of splice sites is an important step in gene prediction. In view of Takagi-Sugen (T-S) fuzzy model with good generalization, robustness and simple structure, a T-S modeling algorithm based on least squares and fuzzy clustering with fuzzy likelihood function is proposed. A GC content-classified (high GC content and low GC content) modeling method is presented based on the relationship between the conservative signal sequences around splice sites and the statistical characteristics that the composition of the up and down stream sequences of splice site depending on the GC content of the sequences around splice sites. The identification accuracy is improved. In order to improve the identification accuracy and reduces computational complexity further, according to the composition and position information of bases in the sequence, an improved naive Bayesian splice site classification is proposed. Based on the kernel method theory, this method adopts Bayesian feature function to map the sequences into a new feature space. The linear relationship between condition attributes and decision attribute was derived and the relationship coefficients is determined by least square method. So a new Bayesian classifier is designed. Simulation results show the computation time is directly proportional to the number of sequences, and the methods has high classification accuracy. The performance is improved compared with SVM-B and the naive Bayesian classifier. This method is very suitable for gene structure identification with large DNA sequence data. 2. The research on accurate protein coding regions Localization. The recognition of protein coding regions is an important research subject in gene prediction. An integrated algorithm for exon identification is proposed. First, according to the conserved sequence of DNA coding regions, support vector machine classification of the first nucleotide of a codon in coding regions is established. Then, according to the period 3 behavior of the first nucleotide of a codon, the output sequences of the model are analyzed through short time Fourier transform, and the position of coding regions can be accurately determinate. As the complexity and diversity of gene structure, in order to improve the identification accuracy, the position of bases in gene should be divided into three classes. A binary SVM classifier can not recognize the position of bases well and the structure of SVM multi-classifier is complicated. T-S fuzzy model is used to construct the gene sequence model. The single output indicates whether the nucleotide in the center of the input window belonging to non-coding regions, the first nucleotide of a codon in a coding region or not the first nucleotide of a codon in a coding region. Then the output sequences of the model are analyzed by short time Fourier transform, and the position of coding regions can be accurately determined.3. The research on Human promoter prediction. The recognition of eukaryotic promoter is a difficult research subject in gene prediction. A promoter recognition algorithm based on the positional densities of oligonucleotides model is proposed. First, a Gaussian Mixture Model (GMM) is adopted to model the positional densities of oligonucleotides to extract the some important motifs which play an important role in signal regulation. Expectation Maximization (EM) algorithm is used to evaluate the parameters of GMM. In order to improve the modeling accuracy, the optimal numbers of Gaussian Mixture Model components and the initial means are determined through the fuzzy cluster. According to the known oligonucleotide position density, weighted Bayesian classifier based on least square is built to identify the Human promoter. The cost of computation is small and suitable for large DNA sequence data. To take advantage of the signal feature of promter to improve the identification accuracy and efficiency, the original promter DNA sequences are projected into the high dimension space of the oligonucleotides positional densities using Bayes feature mapping, and least squares-support vector machine (LSSVM) based on new kernel function corresponding to Bayes feature mapping is established, then Human promoters are identified by LSSVM. Through transformation of this kernel, both the content and position information of oligonucleotide can be integrated, which reflect the characteristic of actual Transcriptional Regulation mechanism well. These prediction methods can be generalized to several other biological problems. The algorithm has good generalization and the cost of computation is insensitive to the input dimension of samples.Finally, the research work of this paper is summarized, and the direction of future work is point out.
Keywords/Search Tags:Bioinformatics, Genome Prediction, Takagi-Sugen Fuzzy Model, Naive Bayesian Classification, Support Vector Machine, Gaussian Mixture Model, Short Time Fourier Transform
PDF Full Text Request
Related items