Font Size: a A A

Study On The Prediction Of Pupylation Sites Based On Semi-supervised Learning

Posted on:2021-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:Y H LiuFull Text:PDF
GTID:2370330647454910Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Pupylation is a protein post-translational modification found in prokaryotes similar to ubiquitination in eucaryotes.During this process,prokaryotic ubiquitin-like protein(PUP)recognizes substrate proteins in cells and modifies the specific lysine residues under the catalysis of enzymes.Recent researches show that pupylation is closely related to the pathogenicity of some pathogenic bacteria,and understanding the mechanism can help the treatment of these bacterial diseases.A essential first step in studies to pupylation is to identify the substrate proteins of PUP and modifiable sites on these proteins.Therefore,the prediction of pupylation sites has become a key step to solve this problem.It is time-consuming,laborious and not always successful to identify pupylation sites by biological experiments.Therefore,the prediction of pupylation sites by computational methods has become an important supplementary means of experimental research.At present,some computational tools have been developed for the prediction of pupylation sites,but the problems of small size of positive sample set,unreliable negative sample set,unbalanced training set and single feature extraction method in these computational methods greatly affect the prediction performance of these tools.Therefore,it is an important direction to improve the prediction performance of the algorithm to synthesize various sequence features and construct a more reliable,large scale and balanced training set.In this study,we propose a new prediction algorithm for pupylation sites based on semi-supervised learning strategies.Firstly,six feature extraction methods are used to extract the features of amino acid sequences and transform the original sequence into the corresponding feature vectors.Then,a reliable training set construction algorithm based on K-means clustering is developed with extracted feature vectors as input data.In this algorithm,we use K-means clustering to extract the corresponding high-density clusters from the positive sample set and the unlabeled sample set,respectively.We use the Synthetic Minority Oversampling Technique(SMOTE)to amplify the extracted high-density clusters of the positive samples to increase the number of positive samples.On the other hand,the Spy Technique is used to filter the high-density cluster extracted unlabeled samples to construct a reliable negative sample set.Above steps adjust the proportion of positive and negative samples while improving the positive sample number and negative sample reliability to solve the imbalance between the number of positive and negative samples.So,a reliable balanced training sample set is constructed.Finally,this sample set is used as the final training set to train a random forest model to identify the predicting sites as the final prediction model.We experiment the algorithm by constructing the training set and the independent test set.The results show that the proposed reliable training set construction algorithm can effectively improve the prediction performance of the algorithm.Compared with other prediction algorithms,the proposed algorithm improved many performance indexes,especially in accuracy and Matthews correlation coefficient which represent the comprehensive prediction performance.The comparison experiments also show that the proposed reliable training set construction algorithm is more effective than other class-imbalance problem solution in pupylation site prediction.The proposed algorithm is also suitable for predicting the sites of other types of protein post-translational modification,especially for prediction problem of small samples of post-translational modification site.
Keywords/Search Tags:Semi-supervised learning, Protein post-translational modification, Pupylation, Functional site prediction, Bioinformatics
PDF Full Text Request
Related items