Font Size: a A A

Research On Machine Learning Based Protein Post-Translational Modification Site Predictions

Posted on:2021-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:J G ChenFull Text:PDF
GTID:2370330602982170Subject:Control engineering
Abstract/Summary:PDF Full Text Request
Post-Translational Modifications(PTMs)site prediction is one of the important research topics in bioinformatics.There are more than 400 known PTM reactions,such as glycacation and citrullination,etc.PTMs play a key role in cellular processes such as gene expression regulation,signal transmission,and protein-protein interaction.Determining whether a certain amino acid in a protein will cause PTMs reaction and investigating the mechanism of reaction in depth will play a decisive role in understanding the pathogenesis of related diseases and broadening the treatment thinking.The identification of PTM sites through experimental methods are expensive,time-consuming and labor-intensive and impossible to carry out on a large scale.It is necessary to develop fast and accurate calculation methods.Based on machine learning,this thesis establishes a corresponding calculation predictor for glycation modification and citrullination modification prediction.The main research contents are as follows(1)Based on Support Vector Machines(SVM),a new protein glycation site predictor Gly-predict is proposed.The k-nearest neighbor algorithm is used to remove redundant negative samples in the dataset;Extracts protein sequence features through binary encoding(BE),accessible surface area,secondary structure probability and gray correlation;The maximum relevancy minimum redundancy(mRMR)feature selection algorithm is used to obtain the optimal feature set.On independent test set,the prediction performance of this method is better than existing methods.However,due to the class imbalance problem in the dataset,the prediction sensitivity of Gly-predict is not satisfactory,and there is room for improvement.(2)To deal with the imbalanced dataset problem,a recurrent neural network(RNN)LSTM RNNs with Long short-term memory(LSTM)units is proposed.The glycation site peptide chain was generated by the appropriate sampling temperature.Taking the binary positive sample as input,the optimal hyperparameters of the network are obtained by grid search method.The effectiveness of this method was verified from the perspective of global charge,hydrophobicity and amino acid composition.The results of statistical significance test show that the generated peptide chain is similar to the original positive sample.(3)In view of the similarity between biological language and natural language,the peptide chain in the dataset is cut into biological words and encoded by continuous distribution representation;with the encoded peptide chain vector as input,DeepGly,a convolutional neural network(CNN)-based protein glycation sites predictor is constructed.By comparing with existing methods,it is confirmed that DeepGly has strong predictive ability.(4)A predictor,PCSPred_SC,is proposed for predicting the citrullination site of proteins.The predictor first selects BE,position specific amino acid propensity(PSAAP),pseudo-amino acid composition(PseAAC)and physico-chemical properties(PP)to extract the sequence features.Subsequently,under the complete feature space,the prediction capabilities of different prediction algorithms,oversampling methods and feature selection methods are discussed respectively.Experimental results show that the predictor with oversampling algorithm has higher sensitivity;adding feature selection methods can reduce the feature dimension and reduce the computational overhead;Compared with other methods on the same dataset,a predictor combined by SVM,adaptive synthesis(Adasyn)and t-Stochastic neighbor embedding(t-SNE)has better performance.
Keywords/Search Tags:Protein post-translational modifications(PTMs), Glycation, Citrulline, Machine learning, Neural network
PDF Full Text Request
Related items