Font Size: a A A

Research On Intrinsically Disordered Protein Prediction Based On Sequence Information

Posted on:2021-02-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y M LiuFull Text:PDF
GTID:1480306569485234Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Protein is an important part of life cells and the main undertaker of life activities.With the development of biological sequencing technology,the amount of protein sequence data has been increasing rapidly.Compared with sequence data,less protein structure data has become the main bottleneck of protein function research.For the study of proteomics,researchers have followed the ”sequence-structure-function” paradigm for a long time.However,with the in-depth study of proteomics,researchers found that intrinsically disordered proteins(IDPs)without stable three-dimensional structure also perform important biological functions.For identifying IDPs,it is increasing urgent to develop the fast and efficient calculation methods,because biological experiments are time consuming and costly.In this paper,several intrinsically disordered protein prediction methods are proposed based on protein sequence information.The main contents of this paper include:Most of the existing prediction methods are based on sliding window,where the protein subsequence in the sliding window are regarded as the sample of the target amino acid residue.Therefore,the window-based methods fail to capture the structure dependence between adjacent residues.In order to solve the problem,IDP-CRF is proposed,which is constructed based on conditional random fields.In this paper,four kinds of protein information are extracted as the state features of amino acid residues,including evolution information,amino acid composition information,protein secondary structure information and relative solvent accessibility information.Transition feature is further integrated to construct the final predictor IDP-CRF.The experimental results show that IDP-CRF achieves comparable performance with deep learning method SPOT-disorder,and outperforms other single model-based comparison methods.It is concluded that the structure dependence between adjacent amino acid residues is of great significance for the prediction of intrinsically disordered proteins.Most of existing prediction methods do not effectively consider the difference information of proteins with different length of disordered regions.In order to solve the problem,we propose an intrinsically disordered protein prediction method called IDP-FSP by fusing three specialized predictors.In this paper,proteins are divided into three categories according to the types of containing disordered regions: proteins containing long disordered regions,proteins containing only short disordered regions and proteins containing general disordered regions.On the basis of conditional random fields,three prediction models for three kinds of proteins are constructed respectively,and the logistic regression model is used to integrate them into the final method IDP-FSP.The experimental results show that IDP-FSP outperforms IDP-CRF proposed in this paper,which illustrates that independent modeling of proteins with different length of disordered regions can effectively improve the predictive performance of IDPs.Most of the existing datasets of IDPs cannot simulate the protein distribution in the real world,which leads to high false positive rate of prediction results.In order to solve the problem,we propose a predictor called RFPR-IDP based on convolutional recurrent neural network,which can reduce the false positive rate for IDP prediction.In the real world,proteins are composed of IDPs and fully ordered proteins.Therefore,in this paper,we firstly extract fully ordered proteins from the protein database according to strict conditions.Then,based on the datasets with different ratios of IDP and fully ordered protein,a convolutional recurrent neural network prediction model is constructed to learn the local sequence patterns and the long-distance dependence of amino acid residues.Finally,the role of fully ordered proteins is analyzed.The experimental results show that IDP-FSP can effectively reduce the false positive rate of the model.Compared with the comparison methods,RFPR-IDP achieves the best performance on most datasets composed of different ratios of IDPs and fully ordered proteins in nature.Different prediction methods have different abilities to characterise the protein feature,and they cannot make full use of protein sequence information.In order to solve this problem,a meta method for IDP prediction called IDP-Meta is proposed.In this paper,IDP-MSA is proposed,which is constructed by combining Long Short-Term Memory Network and gap feature extracted from multiple sequence alignment information.Based on IDP-MSA,IDP-FSP and RFPR-IDP are selected,which have difference protein information and adopt different machine learning methods.Based on the three methods,IDP-Meta is constructed,which can capture the advantages and the complementary information among three basic methods.The experimental results show that IDP-Meta achieve the best performance in the comparison methods.
Keywords/Search Tags:Intrinsically disordered protein, Structural dependency, Length of disordered region, Fully ordered protein information, Gap feature
PDF Full Text Request
Related items