Font Size: a A A

Computational Prediction And Analysis Of Lysine Post-translational Modification Sites Based On Machine Learning Algorithm

Posted on:2020-03-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q NingFull Text:PDF
GTID:1360330620952351Subject:Intelligent Environment Analysis and Planning
Abstract/Summary:PDF Full Text Request
Protein is a kind of organic macromolecule,which is the material basis of life and the basic organic matter that constitutes cells and carries the activities of life.In the process of protein translation,amino acids are linked to form peptide chains,which are spiraled,curled and folded to form precursor proteins.However,precursor proteins are inactive,which can obtaine biological functions through a series of post-translational processes.This chemical modification is called post-translational modification of proteins.There are many forms of post-translational modification of proteins,such as adding various functional groups,chemical bonds or other peptide chains to proteins,which play an important regulatory role in cell functions and biological processes.Some studies have shown that abnormalities and variations of post-translational modification sites are closely related to diseases and cancers.Therefore,predicting and analyzing post-translational modification sites and understanding their biological processes and mechanisms are important topics in proteomics research.Comparing with the traditional experimental prediction methods,computational prediction methods are common in post-translational modification identification because of its convenience and rapidity.Lysine is an?-amino acid encoded by codons AAA and AAG.Its chemical formula is HO2CCH?NH2??CH2?4NH2,which is one of 20 common amino acids.Because of the special molecular structure of lysine,it can be easily modified after translation,and there are many types of post-translational modifications occurring on lysine.Aiming at three types of post-translational modifications?succinylation modification,formylation modification and glutarylation modification?that can occur on lysine,new computational prediction methods based on machine learning are proposed and created,which effectively improves the prediction accuracy of three post-translational modification sites.Proteomic analysis of these three post-translational modified protein data is carried out to explore their potential functions and characteristics.The main research contents are as follows:?1?Two new methods for predicting succinylation sites are proposed,one is PSuccE,which is based on ensemble learning of bilevel support vector machine classifier,the other is SSKMSuc,combining semi-supervised learning method with K-means clustering algorithm.PSuccE uses Bootstrap Sampling strategy to extract different negative sample subsets and then combines them with positive sample set to form multiple different sample subsets.In each subset of data,a two-step feature selection method based on information gain is adopted to select the optimal feature subset for modeling from the whole feature space fusing multiple sequence feature coding methods.Then,a new support vector machine classifier is trained as the final predictive classifier by taking the predictive results of all prediction models as new features.Compared with other predictors on independent test sets,the results show that PSuccE's prediction accuracy is obviously better than the existing methods.The analysis of features and method steps shows that the features adopted in this study can effectively reflect the characteristics of succinylation sites,and the ensemble learning of bilevel support vector machine classifiers can effectively improve the predictive performance of classifiers in all aspects.SSKMSuc builds a new predictive tool for lysine succinylation by fusing adjacent post-translational modification information and multiple sequence features.K-means clustering algorithm is used to process the data set,dividing the data into five clusters.For each cluster,two-step feature selection strategy based on random forest is used to remove redundant features and obtain the optimal feature subset.In each cluster,based on a new semi-supervised learning method,reliable negative samples with the same number of positive samples are selected from non-succinylated samples according to positive sample information.Finally,support vector machine algorithm is regarded as classifier to build model.The analysis of adjacent post-translational modification information showed that succinylation,acetylation and ubiquitination may depend on similar reaction environments,and succinylation at+7 and-4 positions may have some effects on the formation of succinylation at intermediate lysine sites.KEGG analysis of succinylated proteins further confirmes that protein succinylation has a potential impact on amino acid degradation and fatty acid metabolism,and analysis speculated that protein succinylation may be closely related to the occurrence of neurodegenerative diseases such as Huntington disease,Parkinson's disease,Alzheimer's disease.?2?Propose a prediction method,named dForml?KNN?-PseAAC,based on semi-supervised learning and K-nearest neighbor algorithm for protein formylation site prediction.Lysine formylation is an important post-translational modification.However,due to the small amount of data recorded in present database,there is no research on the establishment of a prediction method for lysine formylation.Therefore,we propose a prediction method based on semi-supervised learning and K-nearest neighbor algorithm.According to information entropy,discrete windows are selected to intercept protein sequences instead of traditional continuous windows.Three sequence feature coding methods are used to extract feature information around protein formylation sites and non-formylation sites effectively.This method proposes a semi-supervised learning method to select more reliable non-formylated samples as negative samples for modeling,which not only solves the serious imbalance between positive samples and negative samples accurately,but also ensures the performance of the prediction model.The comparative analysis of the predicted results shows that K-nearest neighbor algorithm is the most suitable classifier for predicting formylation sites,and can effectively predict formylation sites from proteins.Gene Ontology analysis of formylated proteins suggests that there might be a correlation between protein formylation and protein synthesis.?3?A new method,DEXGBGlu,for predicting glutarylation sites of proteins is proposed.DEXGBGlu uses XGBoost?eXtreme Gradient Boosting?as classifier,and the parameters of XGBoost algorithm is optimized by differential evolution algorithm.This method can effectively distinguish glutarylation sites from non-glutarylation sites around lysine sites in proteins by using multiple sequence features.Aiming at the imbalance between positive and negative samples,Borderline-SMOTE?Borderline-Synthetic Minority Oversampling Technology?is used to synthesize positive samples and expand the number of positive samples to make them equal to the number of negative samples.Tomeklinks method is used to clean the data from the combined training set and remove data that may be noise.The analysis of prediction methods and prediction results shows that the differential evolution algorithm improves the classification effect of XGBoost,and Borderline-SMOTE combined with Tomeklinks method not only solves the imbalance between positive and negative samples,but also improves the prediction accuracy of DEXGBGlu,making it significantly superior to other existing protein glutarylation prediction tools.
Keywords/Search Tags:Lysine, post-translational modification, machine learning, semi-supervised learning method, oversampling method
PDF Full Text Request
Related items