Font Size: a A A

Prediction Of Several Protein Post-translational Modification Sites Based On Ensemble Learning

Posted on:2022-07-24Degree:MasterType:Thesis
Country:ChinaCandidate:X WeiFull Text:PDF
GTID:2480306317968529Subject:Statistics
Abstract/Summary:PDF Full Text Request
Protein post-translational modifications(PTMs)are important regulators of protein functions.They refer to the chemical modification of proteins coordinated by PTM enzymes,which play a key role in many physiological processes.Nearly 200 different types of PTMs have been identified.More than half of eukaryotic proteins are post-translationally modified at some point in their biological cycle.Among the common amino acids that make up the protein sequence,cysteine and lysine are the residues that are modified in common protein sequences.At present,through proteomics methods,protein modification sites can be obtained in experiments,and then experimentally The time spent and the cost of the instrument are still facing challenges.Therefore,it is necessary to develop such mathematical calculation methods to predict protein modification sites.In the study of protein modification site data,the classification of non-equilibrium data sets is the current research hotspot.However,the existing traditional machine learning methods generally only apply to balanced data sets.This project extracts characteristic information based on the sequence information and structural characteristics of the protein itself,and then uses non-balanced model algorithms to predict,so as to achieve better prediction results.S-sulfonylation is the reversible oxidation of protein cysteine.It plays a key role in pathology and physiology.In this study,the conditional probability based on the coupled sequence is used to obtain samples.According to the feature information of different locations,an integrated support vector machine method is used to construct an overall classifier,and the majority voting principle is adopted to optimize the prediction effect and develop A predictor named i Sulf?Wide-PseAAC is used to predict the Ssulfonylation site of protein.Through the five-fold cross-validation training data,the final independent test set shows that the performance of Sn,Sp,MCC,and Acc are88.28%,92.16%,79.95%,and 90.80% respectively.Compared with the existing predictor,the improvement effect is obvious.As a newly discovered protein post-translational modification(PTM),malonylation participates in the regulation of human metabolism through chemical modification of the positively charged Lys side chain.The changes in function and structure have a relatively large correlation.This project extracts sequence feature information by combining the coupling information of the protein sequence with the components of the general pseudo-amino acid(PseAAC)sequence,and uses a variety of ensemble learning methods to train and classify the unbalanced data set.Through the cross-validation model,the independent test set results show that the performance of Sn and MCC is better than the existing prediction indexes.Because it is important to consider Sn and MCC in unbalanced data,overall,the improvement effect is still obvious.The research in this thesis is a hot topic at the moment.The research is mainly focused on the two classification of unbalanced data sets.According to the existing protein modification site data,the number of unmodified proteins under normal circumstances is much higher than the number of modified proteins.Therefore,solving the problem of unbalanced classification is more worthy of research and exploration.
Keywords/Search Tags:unbalanced model, pseudo amino acid, feature extraction, posttranslational modification, ensemble learning
PDF Full Text Request
Related items