Font Size: a A A

Research On Embedding Of Protein Tertiary Structure And Its Application In Protein Engineering

Posted on:2022-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:J W LuoFull Text:PDF
GTID:2481306569481344Subject:Bio-engineering
Abstract/Summary:PDF Full Text Request
Protein engineering has important applications in many fields such as medicine,food,chemical industry,and energy,and has the potential to accelerate the development of synthetic biology,medicine,and nanotechnology.In traditional protein engineering methods,directed evolution requires a large number of random mutations and screenings,resulting in low efficiency.Rational design suffers from bottlenecks such as low accuracy due to the lack of understanding of the intrinsic relationship between protein sequence,structure,and function.With the exponential growth of the protein database,it contains evolutionary information of proteins from billions of years,but most of the proteins in the database do not have corresponding labels and annotation information.At present,the development of artificial intelligence technology has provided a new technical framework for the understanding of the intrinsic relationship between protein sequence,structure,and function.Artificial intelligencebased protein engineering will lead a new round of revolution in the field of life sciences.Self-supervised learning,as an artificial intelligence technology that can learn semantic information from unlabeled data,has been used to learn embedding containing biological characteristics from a large number of unlabeled protein sequences in recent years.The embedding can save two orders of magnitude of cost in protein engineering tasks.The functional properties of a protein are mainly determined by its tertiary structure,while there is no relevant research on learning embedding from a large number of unlabeled tertiary protein structures based on self-supervised learning.Therefore,this research aims to build a neural network language model of the "encoder-decoder" architecture to embed and encode the tertiary structure of proteins.The training process of the model is divided into the following three steps:1)Each residue in the tertiary structure of the protein is represented by the relevant characteristics of its neighboring residues in space,and used as model input.2)In the encoder,the context information of the protein sequence is extracted through the bidirectional long short-term memory network.3)In the decoder,the context information of each residue is used to predict the type of the residue,so as to realize the protein tertiary structure embedding(Pts Rep)in a self-supervised learning manner.The clustering performance of Pts Rep in the protein structure classification task is 9.2 times higher than its original input,which proves that it can capture the deep biological semantic information contained in the protein structure.In order to further evaluate the application performance of Pts Rep,this research applied it to two protein engineering tasks: protein stability prediction and av GFP fluorescence prediction.The results show that Pts Rep is superior to the two benchmark embeddings,TAPE-BERT and Uni Rep,which are leading in this field,and only uses 0.12% of the training data and 2.7% of the network parameters of the TAPE-BERT model.In the protein stability prediction task,under the premise of using the same data set,Pts Rep has a Spearman correlation coefficient improvement of 27.4% compared with the previous best model.In the task of av GFP fluorescence prediction,simulation experiment results show that Pts Rep only needs to test 28 mutant sequences to identify the brightest mutant from 25,517 av GFP mutants,which reduces the testing budget by 60% compared with the previous best model.Based on the above research work,this research further integrates the three models of Pts Rep,TAPE-BERT,and Uni Rep with weights.The ensemble model reduces the testing budget by 25% compared with the best sub-model in the task of identifying the brightest mutant of av GFP.In addition,this research developed a Web application to predict the fluorescence of av GFP mutants based on an ensemble model.In summary,this research realized the embedding of protein tertiary structure based on self-supervised learning,which performed better than previous methods on protein engineering tasks,and further improved the model's ability to identify target mutants by integrating multiple embedding models.This research provides a new way to explore the relationship between mutants and functions in protein engineering and provides a new perspective for other proteinrelated research fields.
Keywords/Search Tags:Protein tertiary structure, Artificial intelligence, Self-supervised learning, Embedding, Protein engineering
PDF Full Text Request
Related items