Font Size: a A A

Study Of Protein Representation Learning Methods In Protein-Ligand Affinity Prediction

Posted on:2024-09-20Degree:MasterType:Thesis
Country:ChinaCandidate:R Q LuFull Text:PDF
GTID:2530307079493804Subject:Chemistry
Abstract/Summary:PDF Full Text Request
Binding of small molecule drugs to targets is necessary for their efficacy,and an important task in structure-based drug design is to find molecules that can bind to the target.Traditional structure-based computer-aided drug design methods mainly rely on virtual screening based on docking-scoring functions,which have drawbacks such as high computational cost and poor predictive performance.In recent years,deep learning has made significant progress in natural science fields such as biology and chemistry,such as Alpha Fold2 making breakthroughs in protein structure prediction tasks.In the field of computer-aided drug design,the latest deep learning models have outperformed traditional quantitative structure-activity relationship methods in predicting molecular properties such as ADMET.The affinity between proteins and small molecules is influenced by complex and multiple factors,and traditional scoring functions based on empirical,knowledge or force fields are unable to consider all these complex nonlinear factors.Deep learning models can provide powerful representation learning capabilities and can model complex nonlinear relationships between input and output.The diversity of amino acid sequences and complex three-dimensional structures that make up proteins makes feature engineering and representation learning more difficult,and effective representation learning methods are the foundation for downstream tasks.In this thesis,we investigate representation learning methods for one-dimensional amino acid,three-dimensional point cloud,and three-dimensional atom graph protein data structures,focusing on the task of predicting protein-ligand affinity.For the one-dimensional amino acid sequence of protein,we utilize the selfsupervised pre-training method BERT from natural language processing to perform representation learning on one-dimensional amino acid sequences.Self-supervised pretraining can learn intrinsic features and distributions of the data itself in large amounts of unlabeled protein sequence data,and the abstract features extracted from the pretraining model can be used for transfer learning in downstream tasks.For ligand,we use the molecular graph self-supervised pre-training model Mol GNet based on pairwise subgraph discrimination to extract molecular features.Finally,the two types of features are fused through a Transformer block and used for affinity prediction.We tested the virtual screening performance of this model,and the results showed that the onedimensional amino acid sequence features extracted based on the self-supervised pretraining model achieved better prediction and screening performance compared to the model without pre-training.For three-dimensional structure representation,we first samples protein surface point cloud from the three-dimensional structure.The representation learning methods Point Net and Point Net++ designed for point cloud data are used to learn the representation of three-dimensional protein point cloud.We use Trim Net,a message passing graph neural network,to extract molecular features.Transformer block is used to fuse protein point cloud features and molecular graph features for prediction.The experiments show that Point Net and Point Net++ can effectively learn the representation of protein surface point cloud data.Ablation experiments demonstrates that the coordinate in the point cloud data significantly improves the model’s predictive performance,indicating that the model can learn protein three-dimensional geometry information related to protein-molecule affinity from the coordinates.Due to the potential loss of input information caused by the feature engineering involved in protein surface point cloud sampling and the computation of physical and chemical properties on sampled points,in our final work,we use a 3D graph with atoms as nodes as input and use an equivariant graph neural network based on group equivariant theory to learn the representation of atom features in the protein pocket.We use Trim Net to learn the representation of atom node features in the molecular graph.In this work,we propose for the first time the use of pair interaction supervision inspired by atomic pairwise interactions and knowledge distillation to train the model.Based on these methods,we achieve better performance than existing models in affinity prediction and virtual screening tasks.Furthermore,we demonstrate the effectiveness of these methods through ablation experiments and visualization analysis.In summary,this study investigated three types of protein inputs and their corresponding representation learning methods in the task of protein-ligand affinity prediction.Better prediction accuracy was achieved in affinity prediction and virtual screening tasks compared to other methods of the same type.This work provides new solutions for deep learning-based protein-ligand affinity prediction methods and offers more effective deep learning tools for structure-based virtual screening methods.
Keywords/Search Tags:Protein-Ligand Affinity Prediction, Protein Representation Learning, Deep Learning, Virtual Screening
PDF Full Text Request
Related items