Font Size: a A A

Recognition Of The Splice Sites And Analysis Of Gene Expression Based On Machine Learning Theory

Posted on:2012-11-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:H Q SuFull Text:PDF
GTID:1220330335455540Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Bioinformatics is the interdiscipline of the molecular biology and computer science. The field of bioinformatics involves the analysis and interpretation of various types of data, including nucleotide and amino acid sequences, protein domains, protein structures, gene expression data etc. In fact, the large amounts and high dimensional of bioinformatics data create a critical need for theoretical, algorithmic, and software advances in storing, retrieving, processing, analyzing, and visualizing biological information. Computational algorithms have become an essential component of the research process.This is due to the inherent complexity of biological systems, brought about by evolutionary tinkering, and to our lack of a comprehensive theory of life’s organization at the molecular level. Machine-learning approaches (e.g. neural networks, hidden Markov models, vector support machines, belief networks), on the other hand, are ideally suited for domains characterized by the presence of large amounts of data, "noisy" patterns, and the absence of general theories. The aim of this thesis is that improving the accuracy and efficiency by modifying existing machine learning algorithm based on the statistical property of the bioinformatics data. There are four sections contained:(1) Improvement of the self-organizing feature maps (SOFM).Based on transformation of high-dimensional input space onto a lower-dimensional (usual one or two-dimensional) discrete map while maintaining original similarity relations, SOFM have demonstrated several beneficial features that make them a valuable tool in pattern discovery, data analysis etc. During the updating the weighs of the neural, the Kohonen learning algorithm is controlled by two learning parameters the learning coefficient and the width of the neighborhood function, which have to be chosen empirically because there exists neither rules nor a method for their calculation. To circumvent these parameters study, a novel methods was proposed into the learning algorithm, which can adjust the learning coefficient and the width of the neighborhood function by unscented Kalman filter (UKF) and Kalman filters (KF) respectively.(2) Application of kernel methods.Kernel methods, which generalize linear learning methods to non-linear ones, have become a cornerstone for much of the recent work in machine learning and have been used successfully for many core machine learning tasks such as clustering, classification, and regression. In practice, kernel methods depend on an appropriate kernel function and parameters which should be chosen by the statistical property of the data. For SAGE data which was obeys Poisson distribution, a poisson-model based kernel (PMK) was proposed.(3) Recognition of the splice sites.In eukaryotic cells, most genes are interrupted by introns that must be removed before the genetic information can be decoded. RNA polymerase does not discriminate these introns from coding regions (exons) since they are normally transcribed together as a common precursor mRNA (pre-mRNA). Splicing, the process that removes introns from a pre-mRNA, probably represents the most important post-transcriptional step to determine the protein output from a gene. Obviously, intron-exon boundaries have to be precisely defined. The SOFM, whose parameters were adjusted by UKF and KF, was used for Humo Sapiens Splice Site Dataset (HS3D).(4) Analysis of the SAGE data.Support Vector Machines (SVM) and Kernel Principle Component Analysis (KPCA) are the two algorithms of kernel methods. SVM is built upon the structural risk minimization principle from the statistical theory, which suggests that generalization error of learning machines is bound by both empirical risk and confidence interval. The KPCA is an efficient generalization of traditional Principle Component Analysis (PCA) that allows for the detection and characterization of low-dimensional nonlinear structure in multivariate data sets. The SVM based on PMK and KPCA based on PMK were used for SAGE data analysis.
Keywords/Search Tags:Machine Learning, Bioinformatics, Gene expression, Splice Sites
PDF Full Text Request
Related items