Font Size: a A A

Research On Key Techniques For Protein Residue Contact And Distance Prediction

Posted on:2022-12-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:H L ZhangFull Text:PDF
GTID:1480306773970889Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Protein structure prediction can help researchers to understand the function of proteins at the atomic level with low cost and high efficiency,and accurate residue contact and distance prediction are of great significance for sequence-based protein structure prediction.The applications of residue contact and distance prediction have been extended to the identification of disordered regions of protein structures,segmentation of protein domains,acceleration of molecular dynamics simulations,prediction of proteinprotein interactions,and protein design.Residue contact prediction methods have undergone the evolution process of local correlation analysis,direct coupling analysis and machine learning,and have achieved rapid development with the introduction of deep learning techniques.However,current researches on residue contact/distance prediction still has some limitations.First,there currently lacks a large-scale benchmark test and comprehensive performance evaluation of in-field methods.Secondly,there are differences and complementarities in the prediction results of different contact prediction methods,but there is still a lack of an integrated strategy for different contact prediction methods with optimal efficiency and good performance.Finally,although deeplearning-based residue contact/distance prediction are showing rapid progress,the performance of residue contact/distance prediction for complex scenarios still needs to be further improved.To this end,this paper studies several basic problems that need to be solved urgently in the field of residue contact and distance prediction,and carries out the following three researches:1.A large-scale benchmarking and comprehensive performance evaluations of existing residue contact/ distance prediction methods are carried out.This study firstly constructed a benchmark protein dataset that satisfies both protein type diversity and high non-redundancy.Based on the dataset,the study conducts a retrospective analysis of traditional machine learning,evolutionary coupling analysis and consensus machine learning methods,as well as a multi-perspective study on the rapidly evolving deep learning methods.The main findings of this study include: the natural contact density of residues and the quality of multiple sequence alignment are the key intrinsic and extrinsic factors affecting contact/distance prediction,respectively;different types of residue contact/distance methods are suitable for different application scenarios;there are also differences and complementarities in the prediction results of different methods.Although Deep learning-based prediction methods lead the way in overall prediction performance,there is still much room for improvement:(1)With shallow multiple sequence alignments,the performance will be greatly affected.(2)Current methods show lower precisions for inter-domain compared with intra-domain contact predictions,as well as very high imbalances in precisions between intra-domains.(3)Strong prediction similarities between deep learning methods indicating more feature types and diversified models need to be developed.The main contribution of this part of research is that through large-scale benchmark performance evaluation,the research discovers the key factors affecting the performance of contact/ distance prediction,discusses the best applicable scenarios of different methods,and explores the prospective direction for further improvement.This research will provide valuable guidance for the development of future contact/distance prediction methods.2.A consensus method,named COMTOP,based on mixed integer linear programming is proposed for protein residue contact prediction.Through the large-scale performance evaluation,it is found that there are similarities and differences in the prediction results of different types of contact prediction methods based on machine learning,evolutionary coupling analysis and deep learning.In response to these characteristics,COMTOP uses mixed integer linear programming technology to integrate seven different types of sub-methods,and makes full use of the differences and complementarities of existing methods when building models and searching for optimal parameters to further improve the precision of residue contact prediction.COMTOP can not only overcome the problem of low prediction accuracy of traditional machine learning and evolutionary coupled analysis methods,but also effectively avoid the overfitting of deep learning and traditional machine learning methods;COMTOP has higher robustness than the seven individual methods for contact prediction of different types of proteins,and compared with the individual methods,its performance improvement becomes more and more obvious with the increase of the number of predicted contacts.This study evaluates the proposed method on four independent test sets.The experimental results show that COMTOP has an average improvement of 13.6% in prediction precision and a maximum improvement of 27.1% compared with the best-performing sub-method.3.A residue distance prediction method based on deep residual network,named Duet Dis,is proposed for protein residue distance prediction.Duet Dis first introduces squeeze excitation and dilated convolution into the deep residual network,integrates different types of feature sets,uses metagenomic database to construct the multiple sequence alignment file of the training set,integrates multi-domain proteins in the training set and strengthens the training of the inter-domain regions,and incorporates submodels obtained using different training strategies and different feature sets.Duet Dis is less dependent on the number of effective sequences in multiple sequence alignment,while is robust for the prediction of inter-domain distances for multi-domain proteins.The experimental results show that:(1)When there is only one effective sequence in the multiple sequence alignment,the prediction accuracy of Duet Dis can achieve 60%,which is 6.7% and 10.5% higher than the peer methods Raptor X and tr Rosetta,respectively;(2)When uses only one-sixth of the total training data of multi-domain proteins for the reinforcement training of inter-domain regions,the precision of Duet Dis for interdomain distance prediction can be improved by 7.3% and 9.4% compared to Raptor X and tr Rosetta.
Keywords/Search Tags:Residue contact/distance prediction, multiple sequence alignment, largescale performance evaluation, mixed-integer linear programming, deep residual network
PDF Full Text Request
Related items