Font Size: a A A

System and tools to support a Bayesian approach to improving large-scale metabolic models

Posted on:2009-01-15Degree:Ph.DType:Dissertation
University:The University of ChicagoCandidate:Shi, XinghuaFull Text:PDF
GTID:1440390005460739Subject:Computer Science
Abstract/Summary:
With the rapid availability of hundreds to thousands of sequenced genomes, the construction of genome-scale metabolic models for these organisms has attracted much attention. Although current genome/pathway databases provide a large proportion of metabolic information that can be used directly to build metabolic models, there are still a number of problems that introduce network holes and thus make these models incomplete. Network holes occur when the network is disconnected and certain metabolites cannot be produced or consumed. A number of factors can lead to network holes such as missing genes, incorrect or missing annotations, poor mappings from functions to biochemical reactions.;Up to now, manual search for candidates to fill network holes is still dominating in the construction of genome-scale metabolic models. Because of this time-consuming and labor-intensive manual work, only two dozen such models are published. In order to construct a genome-scale metabolic model for hundreds to thousands of organisms available, it is desirable that computational approaches be applied to accelerate the model-building process.;Toward the automatic reconstruction of metabolic models, we propose STeAM, a system and tools to support a Bayesian approach to improving genome-scale metabolic models. An infrastructure that incorporates all computational tools is built to enable experiments with computational tools and schemes in STeAM. First, a set of tools is designed to integrate and reconcile different data from a variety of databases, namely, a genomic database, the SEED; a genomic and pathway database, KEGG; and a database of published genome-scale metabolic-models, BiGG. Next, network connectivity is analyzed and network holes are detected.;With the aim of filling network holes, various data from databases are organized, computed, and processed to prepare for the construction of reaction predictors that can generate candidate hole-filling reactions. In total, a collection of 23 types of evidence is extracted from databases the SEED, KEGG and BiGG. This topological and biological evidence can be categorized as follows. (i) At the gene level, three types of evidence are collected from 560 complete genomes in SEED, including the gene co-occurrency, the gene co-occurrency in gene clusters, and the co-occurrency of gene-genes pairs in gene clusters. (ii) At the reaction level, ten types of evidence are collected: reaction priors and reaction co-occurrency in five data resources. These five data resources are the reconstructed iJR904 and iSB619 models in BiGG, the reference pathway map in KEGG, network modules of KEGG, 736 organism maps in KEGG, and 560 draft models in the SEED. (iii) At the segment level, segment priors and the co-occurrency of reaction-segment pairs are extracted five data resources as in the reaction level. After evidence is obtained from existing databases, 23 individual predictors are created to use this evidence based on Bayesian approaches. Then, in order to combine these individual predictors and unify their predictive results, an ensemble of individual predictors is built on majority vote and four classifiers: Naive Bayes Classifier, Bayesian Network, Multilayer Perceptron Network and AdaBoost.;Three sets of experiments are performed to train and test individual predictors and integrative mechanisms of single predictors, and eventually evaluate the performance of the system and computational tools. The first set of experiments involves self-consistency check of two reconstructed iJR904 and iSB619 models by dealing with "Knockout and Recover" of core metabolic subnetwork. The second set of experiments focuses on how the deletion of different parts of a model, where the deletion is either totally random or based on connected subgraphs of the model, affects the recovery ability of computational methods. The third set of experiments involves using a new genome-scale metabolic model for C. acetobutylicum as a test model by improving its draft model from the SEED. The thorough analysis of various data and new results gained from experiments not only provide insight into the properties of metabolic networks, but also reveals the meanings and relationships among different date entities. Moreover, these newly discovered knowledge can feedback to existing data resources and enhance our current knowledge of genome annotations and metabolic models.
Keywords/Search Tags:Metabolic models, Data resources, Tools, Network holes, Bayesian, KEGG, Improving, Individual predictors
Related items