Font Size: a A A

Active learning for the prediction of asparagine and aspartate hydroxylation sites on human proteins

Posted on:2012-04-06Degree:M.A.ScType:Thesis
University:Carleton University (Canada)Candidate:Iyuke, Festus OmonighoFull Text:PDF
GTID:2458390011951888Subject:Engineering
Abstract/Summary:PDF Full Text Request
This thesis reports on the development and evaluation of a pool-based active learning approach to create support vector machine (SVM) classifiers for the prediction of asparagine/aspartate (N/D) hydroxylation sites on human proteins. The verification of hydroxylation sites on human proteins in wetlab experiments is very costly and sometimes time-consuming to achieve. The active learning procedure could therefore be used to choose which putative hydroxylation sites should be selected for future wetlab experimental validation and verification in order to gain maximal information. Using a dataset of N/D sites with known hydroxylation status, we here demonstrate through simulations that active learning query strategies can achieve higher classification performance with fewer labelled training instances for hydroxylation site prediction, compared to traditional passive learning. The active learning query strategies (uncertainty, density-uncertainty, certainty) are shown to identify the most informative unlabelled instances for annotation by an Oracle at each learning cycle. Furthermore, our experimental results also show that active learning strategies are highly robust in the presence of class imbalance in the available training data.;Considering that simulations clearly demonstrated the advantage of active learning for this application, certainty-based and uncertainty-based strategies were therefore applied to select the most informative 20 putative N/D hydroxylation sites from the 1.3 million putative N/D hydroxylation sites in the entire human proteome. Only two of these proteins were successfully isolated, quantified, and overexpressed in mammalian cells in an in vitro experiment, due to experimental limitations. The biological activity of these proteins was verified using Western blotting, immunoprecipitation, and Coomassie stain analysis based on the protein expression identified on an SDS-PAGE gel. The successful identification of these proteins' overexpression on the gel lays the foundations for the determination of the true annotation of these putative N/D hydroxylation sites via mass spectrometry. Following the active learning algorithm, ultimately, the classification of these new N/D sites will be used to further increase the prediction accuracy of the SVM-based classification model.
Keywords/Search Tags:Active learning, Sites, Prediction, Human, Proteins
PDF Full Text Request
Related items