Font Size: a A A

Research Of Protein Post-translational Modification Site Prediction Using Deep Learning

Posted on:2019-05-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:D L WangFull Text:PDF
GTID:1360330548956759Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Post-translational modification(PTM)generally refers to the addition of a functional group or a small molecular protein covalently binding to a specific position of a protein amino acid sequenc.Over 400 types of PTMs have been identified,such as as in phosphorylation,acetylation,methylation,ubiquitination and so on.PTM influence almost all aspects of cell biology and pathogenesis.They play key roles in many cellular processes,such as cellular differentiation,signaling and regulatory processes,regulation of gene expression,and proteinprotein interactions et al..The aberrances of PTMs are strongly associated in diseases and cancers,while a variety of regulatory enzymes involved in PTMs have been drug targets.Therefore,the study of PTM is important in proteomics.A premise of studying PTM is the thorough and robust identification of PTM sites.Although the specific PTM sites can be detected by high-resolution mass spectrometry,the experimental methods are expensive and time consuming,so they are not suitable for largescale detection,and even not possible to do the whole proteomic-level detection.Compared with conventional experimental methods,the computational identification of PTM sites by bioinformatics methods provides an alternative strategy with advantages of fast speed and low cost.The existing bioinformatics methods for computational identification of PTM sites could be castigated into two groups: the statistical-based analysis methods and the machine learningbased prediction methods.Compared with statistical-based methods,machine-learning methods can capture more complex sequence features,so that their prediction performance are much better.Although a series of machine learning methods for PTM site prediction have been developed,the prediction accuracy of the existing methods has significant room for improvement.Most existing methods are based on feature extraction with researchers intervention which may result in incomplete or biased features,and even worse when the type of PTM do not have known related features.As the cutting-edge machine learning algorithm,deep learning method is very good at discovering the intricate structures in the original highdimensional data and is therefore applicable to many domains of science,business,and government,and there has been a growing interest in applying deep learning methods in the analysis of biological sequences in recent years.In this paper,we presented two prediction models based on deep learning and multiple training strategies for PTM site prediction,and also explored the application of other deep learning frameworks that were successfully applied to other biological sequence analyses.The main research content of this paper is as follows:(1)For the general phosphorylation site prediction,we proposed a novel deep-learning framework based on a two-directional attention mechanism on both sequence dimension and feature map dimension,called MusiteDeep.MusiteDeep takes the raw protein sequences as input,and automatically abstract the sequence features through multi-layer convolutional neural networks avoiding the feature selection with human interventions.The proposed twodirectional attention mechanism generated the phosphorylation-related protein sequence representations from both the sequence dimension and the feature map dimension,and these representations showed their biological interpretability by visualization.The comparison of MusiteDeep with other tools on the benchmark dataset reflects the significant advantages of MusiteDeep in the general phosphorylation site prediction,especially that it achieves over a 50% relative improvement in the area under the Precision-Recall Curve(PRC).(2)We also implemented several deep-learning architectures that were applied to other biological sequence analyses in the general phosphorylation site prediction problem,including the one-layer convolutional neural network architecture as in DeepBind which was used for sequence-specific prediction of DNA and RNA-binding proteins;the hybrid convolutional neural network and long short-term memory architecture as in Dan Q which was used to predict the properties and functions of the DNA sequences;and a multi-layer recurrent neural network architecture for protein function prediction.And they were compared with the MusiteDeep framework on the benchmark dataset.(3)In order to solve the small-sample training problem of deep learning in kinase-specific site prediction,the transfer-learning method was applied to the MusiteDeep framework making use of the hierarchical structure in general and kinase-specific phosphorylation.On the benchmark dataset containing only a few hundred training samples,the MusiteDeep model trained by transfer learning achieves comparable sensitivity with other tools,and achieves better precision in most cases.The PRC curves,especially of kinases CDK,PKA and CK2,are significantly superior to other existing methods.Finally,the transfer-learning method was integrated into the MusiteDeep framework and the MusiteDeep toolkit was developed.The MusiteDeep toolkit provides predictions for general and kinase-specific phosphorylation sites,and also supports the custom training for other PTM site prediction models.(4)To address the unbalanced training in the PTM site predictions,a Bootstrapping-based ensemble method that integrated multiple deep-learning models was proposed.This paper verifies that the models trained by the Bootstrapping method achieved improved prediction performance in the general phosphorylation site prediction,while it can be easily apply to other PTM site predictions and it will show more significantly robust with more unbalanced data.(5)In order to further improve the prediction accuracy of PTM site prediction,the trainingdata serialization and training-model parallelization strategies were proposed.In the trainingdata serialization strategy,the hybrid species model was introduced,and improved the prediction accuracy of species with fewer annotation samples for general phosphorylation site prediction.In the training-model parallelization strategy,the parallelization model was introduced so that training data of different amino acid residues can be simultaneously trained,and the learning of the general phosphorylation features was enhanced while the refinement of the amino acid-specific features were refined,the phosphorylation site prediction performances for threonine and tyrosine were improved.(6)For the first time,the capsule network(CapsNet)was applied to the study of biological sequences,and was explored its feasibility in the PTM site prediction problems,including phosphorylation,N-linked glycosylation,N6-acetyllysine,methy-arginine,S-Palmitoylcysteine,Pyrrolidone-carboxylic-acid,and SUMOylation.The experiments in the paper showed that in most cases,the prediction results of the proposed architecture of CapsNet for PTM site prediction were superior to those of other machine learning tools,especially in learning from small training data.Caps Net not only has outstanding performance in predicting accuracy but also has capacity in exploring internal data distribution related to biological significance.For example,without any kinase annotation,the internal capsules can learn the features related to kinase families and generate sequence logos that is consistent with the realkinase families;protein sequences generated from high-level capsules had stronger discriminant power in distinguishing kinase substrates from different kinase families than other representations.We also demonstrated on the real data that the length of the output capsule can be used as an estimate of the prediction reliability for a particular PTM.In summary,this paper is a study of deep learning for protein post-translational modification site prediction.Two main deep-learning architectures were presented in this paper: the MusiteDeep model and the CapsNet model.MusiteDeep model has advantage in PTM site predictions with large annotation data,and the CapsNet model is more suitable for PTM site predictions with small annotation data.To our best knowledge,the MusiteDeep framework proposed in this paper is the first deep-learning framework for phosphorylation site prediction,and the design of the CapsNet-based framework is the first CapsNet application in biological sequence analyses.This article also investigated the biological interpretability of deep-learning models and various deep-learning training strategies,which provided a set of procedures to apply deep learning to PTM site prediction problems,and presented examples for the application of deep learning in protein sequences,and it will inspire the application of deep learning in other bioinformatics applications.
Keywords/Search Tags:bioinformatics, machine learning, deep learning, protein post-translational modification site prediction
PDF Full Text Request
Related items