Font Size: a A A

Classification and knowledge discovery in protein databases

Posted on:2005-03-19Degree:Ph.DType:Dissertation
University:Temple UniversityCandidate:Radivojac, PredragFull Text:PDF
GTID:1450390011450950Subject:Computer Science
Abstract/Summary:
One of the major objectives of bioinformatics in the post-genomic era is automated characterization of a large number of available protein sequences. The ultimate goal of such a characterization is detailed understanding of protein function and its complex network of interactions with other molecules in biochemical pathways. In this study we addressed several issues frequently encountered in classification and knowledge discovery in protein databases and made a step further in characterization and prediction of intrinsically disordered proteins. First, we concentrated on the problem of classification in noisy, high-dimensional, sparse, and class-imbalanced datasets. Restricting ourselves to the two-class classification framework, we put emphasis on the cases where one class (positive or minority class) is underrepresented and small, while the other class (negative or majority class) is arbitrarily large. We designed a complete classification system that includes a permutation-test based feature selection filter and then combines over-sampling of the minority class, under-sampling of the majority class, and ensemble learning to address noise and class imbalance. The best overall method was then combined with clustering and estimation of a priori class probabilities from unlabeled data into a unified system for prediction on large protein databases. Second, we studied statistical properties of protein data belonging to low-B-factor ordered regions, high-B-factor ordered regions, short intrinsically disordered regions, and long intrinsically disordered regions. We provided evidence that all four groups are distinct types of protein flexibility with the low-B-factor ordered regions being considerably different from the remaining three groups. Furthermore, amino acid compositions of the low-B-factor ordered regions, high-B-factor ordered regions, short disordered regions, and long disordered regions are all distinct and not merely quantitative differences on a continuum. Based on these differences, a predictor of high-B-factor ordered regions was constructed. Third, in addition to ordered and disordered regions, we also studied boundary regions between ordered and long disordered regions. We found specific amino-acid signals that are characteristic for the boundary regions and subsequently built a predictor of order/disorder boundaries. This predictor was then combined with a standard order/disorder predictor into a preliminary boundary-augmented model. Finally, we studied amino acid substitution patterns of intrinsically disordered proteins and constructed a new scoring system, i.e. a scoring matrix and gap penalties, that improves sequence alignments of intrinsically disordered proteins.
Keywords/Search Tags:Protein, Class, Regions
Related items