Statistical models for the analysis of heterogeneous biological data sets

Posted on:2004-11-27

Degree:Ph.D

Type:Thesis

University:University of Pennsylvania

Candidate:Buehler, Eugen Christian

Full Text:PDF

GTID:2460390011475841

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

The focus of this thesis is on developing methods of integrating heterogeneous biological feature sets into structured statistical models, so as to improve model predictions and further understanding of the complex systems that they emulate. Combining data from different sources is an important task in genomics because of the increasing variety of large-scale data being generated, all of which reflect different components of the same complicated network of biological interactions that make up an organism. We contend that traditional machine learning techniques are too general to accurately model heterogeneous biological data and provide insufficient feedback to researchers concerning the systems being modeled. In contrast, we will show that interpretable statistical models specifically designed for and inspired by the underlying structure of biological problems yield more accurate predictions and provide valuable insight into biological systems.; Toward proving this thesis, we introduce maximum entropy biological sequence models. Maximum entropy sequence models have been used previously to integrate arbitrary features in other (non-biological) domains, such as natural language modeling. Here, we apply the same model structure to amino acid and nucleotide sequences. We first propose a broad variety of biologically inspired features, define them mathematically, and test their ability to improve models of amino acid sequences. Of these features, particular attention is paid to long distance features such as triggers, which incorporate information unavailable to more conventional Markovian models and reflect the non-local nature of protein sequence constraints. The ability of these features to improve gene-finding models is demonstrated. We next extend maximum entropy models to nucleotide coding sequences and apply them to the detection of lateral gene transfer. This allows us to evaluate a diverse set of features in a statistically rigorous manner, improving understanding of the problem and eliminating the tendency to inaccurately label short genes. We also develop methods for integrating positional and gene expression data with our maximum entropy sequence model, allowing more accurate predictions of lateral gene transfer and resulting in significant biological insight.

Keywords/Search Tags:

Biological, Models, Maximum entropy, Data, Sequence

PDF Full Text Request

Related items

1	The Principle Of Maximum Entropy And Minimum Entropy Methods In The Measurement Data Processing Applications
2	The Life Distribution Class Based On Entropy And Maximum Dynamic Entropy Models
3	Research On Key Issues Of Bayesian Maximum Entropy Spatiotemporal Prediction And Its Application
4	Research On Severl Kinds Of Entropy Of Continuous Map
5	Maximum entropy estimation of seemingly unrelated regression and its application to Chinese household expenditure survey data
6	Alignment-free Sequence Similarity Analysis And Clustering Algorithms On Biological Sequences
7	Based On Constrained Choice Probability Density Function Of The Maximum Entropy Method To Estimate
8	Study On Multivariate Maximum Entropy Models And Their Applications In Coastal And Ocean Engineering
9	Research On Spatial Data Interpolation Modeling Based On Linear Maximum Entropy Principle
10	Data Information Pattern Recognition Theory And Its Applications