Font Size: a A A

Learning from protein structure related data

Posted on:2007-02-08Degree:Ph.DType:Dissertation
University:Temple UniversityCandidate:Peng, KangFull Text:PDF
GTID:1440390005971214Subject:Computer Science
Abstract/Summary:
Three-dimensional (3-D) structure of a protein could provide valuable insights into its biological functions. However, due to limitations in current technology only a small proportion of known proteins have their structures experimentally determined. Therefore, computational approaches that learn from protein structure related data to predict structure from amino acid sequence are becoming increasingly attractive. The first part of this dissertation addresses the sample selection bias problem in current protein structure data, i.e. proteins with experimental structures are not representative of all natural proteins. A contrast classifier framework was first proposed for detecting and characterizing such bias in general machine learning context. It was then applied to explore bias in two protein structure related databases: the Protein Data Bank (PDB) of experimental protein structures and the TargetDB database of structural genomics (SG) targets. The results indicated that contrast classifier could be a useful tool for understanding the bias in current protein structures and for improving target selection/prioritization for structural genomics projects. The second part of this dissertation examines a special case of learning from protein structure related data, i.e. prediction of intrinsically disordered regions. Here intrinsically disordered regions refer to protein sequence regions that lack stable 3-D structures under physiological condition but still carry out important biological functions. Four VL3 predictors were first developed for prediction of long disordered regions (>30 residues). By incorporating evolutionary information and using optimized predictor models, the VU predictors achieved significantly higher prediction accuracy than previous long disorder predictors. However, they were significantly less accurate on short disordered regions (≤30 residues) due to a length-dependent heterogeneity in amino acid compositions. To address this problem, the VSL2 predictors were developed by using a meta predictor to combine two specialized predictors optimized for short and long disordered regions respectively. Experimental evaluation showed that VSL2 achieved well-balanced accuracy on both types of disordered regions and were significantly more accurate than several existing predictors. As the final part of this dissertation, an iterative procedure was proposed for efficient learning of neural-network-ensemble predictors from arbitrarily large datasets; it could be potentially useful in learning more accurate protein structure predictors.
Keywords/Search Tags:Protein, Structure, Data, Predictors, Disordered regions
Related items