Font Size: a A A

Study On Protein Phosphorylation Prediction Algorithms Based On Multiple Features

Posted on:2015-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:W W FanFull Text:PDF
GTID:2250330431950003Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
As one of the most crucial post-translational modifications, reversible protein phosphorylation regulates various biological processes in many eukaryotes. It has been described as the molecular switch of cellular activities vividly, regulating almost all the processes of life activities, such as cell growth, development and apoptosis. Therefore, in-depth study of the mechanism of phosphorylation and the impact on protein functions is one of the research directions worthy of exploring in modern biological research.The methods of protein phosphorylation site identification contain experimental techniques and computational methods. The common experimental techniques for protein phosphorylation site identification are32P radioactive labeling, mass spectrometry method and etc. Experimental identification techniques are intensive in labor and take lots of time. So it is not feasible for all the protein sequences in proteomics. Then, computational methods for phosphorylation site prediction gain rapidly development in recent years. Experimental techniques accumulate amount of identified data for the development of bioinformatics. The methods of phosphorylation data mining have been proposed for unknown phosphorylation site prediction. As an effective method in bioinformatics, computational methods play a guiding role for the experimental methods.This study employs machine learning methods to predict potential phosphorylation sites, which conducting a systematic and hierarchy-specific investigation of protein phosphorylation site prediction. Firstly, the protein kinases are aggregated into hierarchical structures, according to the method proposed by Manning, with four levels including group, family, subfamily and kinase. Then protein sequences are derived from Phospho.ELM database and the corresponding kinases are mapped to the hierarchical structures to constitute several protein kinase datasets at different level. Next, the functional features of proteins, including gene ontology and protein-protein interaction, are extracted from Gene Ontology and STRING databases to construct feature sets for phosphorylation site prediction, respectively. Since the dimension of gene ontology and protein-protein interaction is vast, a feature selection method called "two-step sequential forward selection" based on mRMR is proposed to select effective features and an optimal subset of features is extracted for each protein kinase. On this basis, random forest is employed to construct classification models for potential phosphorylation site prediction. The prediction results using Phospho.ELM with ten-fold cross validation and additional testing demonstrate that the proposed method remarkably outperforms existing phosphorylation site prediction methods at all hierarchical levels. Especially, the false positive rate was controlled at the level of one and five percent, the prediction performance for positive data can still achieve high prediction accuracy rate. Finally, in order to facilitate the use of the prediction method, we implement the relevant prediction tool to provide guidance and assistance to the relevant research area.
Keywords/Search Tags:phosphorylation, protein kinase, functional feature, feature selection, random forest, site prediction
PDF Full Text Request
Related items