Font Size: a A A

Predicting Protein Disorder Regions With Hybrid Sequence Complexity

Posted on:2021-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:S P HeFull Text:PDF
GTID:2370330611983353Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Organisms rely on proteins to perform various biological functions.Protein scientific research has always been one of the core areas of biological research.Traditional protein scientific research follows the research paradigm of "sequence-structure-function",that is,the amino acid sequences determine its three-dimensional structure,and the threedimensional structure determines its biological function.However,since the 1990 s,people have sporadically discovered that certain proteins do not have a stable three-dimensional structure,but can still participate in specific biological processes.Over time,researchers have discovered more of these proteins and gradually formed a class of Intrinsically Disordered Proteins(IDPs)that are different from traditional proteins.In whole or in part,IDP contains fragments of amino acid residues that cannot form a stable three-dimensional structure,called Intrinsically Disordered Region(IDR).Over the past two decades,IDR has been reported to play a key role in many biological processes including cell signal transduction,protein phosphorylation,chromatin structure remodeling,and Super Enhancer(SE).More importantly,cutting-edge research in the past two years has shown that proteins involved in biological processes form droplet condensate through IDR and eventually form liquid-liquid separation,which is reported to be associated with certain neurodegenerative diseases.The connection is extremely close.For example,the IDR of FUS protein and hn RNPA1 protein participates in the formation of droplets in amyotrophic lateral sclerosis disease,and as the droplets become more viscous and eventually form fibrous solids,leading to disease.Therefore,IDR has become one of the hot spots in the current frontier research of biology.The research progress and achievements in this field not only have important scientific research value,but also have potential application prospects in the analysis of human complex disease mechanisms.Current research methods for identifying protein IDRs are roughly divided into two categories: one is experimental methods and the other is computational methods.The experimental method is based on existing physical or chemical means,including X-ray,nuclear magnetic resonance,protease hydrolysis experiment,etc.When the experimental conditions are not available,the calculation method with high accuracy is a good alternative choice.Over the past two decades,researchers have developed dozens of calculation methods to identify protein IDRs,such as IUpred,DISOPRED3,Pr DOS,POODLE,etc.In this paper,the hybrid sequence complexity algorithm is used to characterize the sequence characteristics of IDR,and both phosphorylation and hydrophilicity are used to characterize the physical and chemical characteristics of IDR.First,because IDR contains a large number of repetitive amino acid residue fragments,it has obvious low complexity characteristics,which inspired us to use the concepts of factor complexity and Abelian complexity in mathematics to describe the complexity characteristics of amino acid sequences.Secondly,in view of the close relationship between phase separation and phosphorylation and hydrophilicity reported in the literature,we add phosphorylation information and hydropathic index information of sequence sites on the basis of sequence characteristics to reflect the physical and chemical properties of IDR to further improve the accuracy.Experiments show that the characterization based on the complexity and physicochemical characteristics of mixed sequences has obtained a good prediction effect.First,we selected Uniprot90 as the training data set,mixed complexity as the algorithm,Random Forest(RF),Support Vector Machine(SVM),K Nearest Neighbors(KNN)and Naive Bayes(NB)as the optional classifier,and selected 5-fold cross validation to select the model and the optimal parameters and preliminary evaluation of the model.The results show that the optimal classifier is RF,and the corresponding optimal parameters are: the optimal sliding window is 4,the number of RF forest trees is 210,and the maximum feature of each tree is 2.Under 5-fold cross-validation,the accuracy is 0.875,Matthews correlation coefficient is 0.745,and the area under the ROC curve AUC is 0.931.Subsequently,we selected two gold data sets,CASP9 and CASP10,as independent test sets to further evaluate the pros and cons of the model.The results show that the performance of the above three indicators on the independent test set decreased slightly,ACC was 0.788 and 0.780,MCC was 0.601 and 0.582,and AUC was 0.835 and 0.857.Further,we introduced phosphorylation site information and hydropathic index information to re-model on the basis of the above,and used the independent test sets of CASP9 and CASP10 to evaluate the new model.The results show that all indicators have improved to a certain extent.In particular,the AUC indicators rose from 0.835 and 0.857 to 0.878 and 0.902.Our method shows superiority in a comprehensive comparison with existing methods.We compared this method with existing calculation methods such as IUpred(long),IUpred(short),SPINE-D,Diso Pred3,Deep CNF-D,Deep CNF-D(ami?only),etc.This method achieved the best prediction effect on the MCC(0.601)indicator,and it was slightly worse than the best prediction result of 0.855 on the AUC(0.835)indicator,ranking second.Importantly,the improved method after adding the phosphorylation site information and hydropathic index information has achieved the best prediction performance on the above three indexes of ACC,MCC,and AUC.This shows that the hybrid sequence complexity algorithm combined with phosphorylation site information and hydropathic index information can more effectively characterize IDR,and shows a strong advantage in comparison with existing methods.Finally,as an application of this model,we conducted a predictive study on the SE formation mechanism of Triple Negative Breast Cancer(TNBC),and gave our reasonable conjecture on a certain basis.In conclusion,our research results show that the IDR characterization method based on mixed sequence complexity and phosphorylation and hydrophilicity information has intuitive rationality and predictive validity,and shows superiority in comprehensive comparison with existing calculation methods.Finally,we hope that this method can become an important calculation method in the field of predicting IDR and provide powerful assistance in the research and development of diseases.
Keywords/Search Tags:protein disordered region, liquid-liquid phase separation, factor complexity, Abelian complexity, protein phosphorylation, Random Forest
PDF Full Text Request
Related items