| Proteins are fundamental substances in living organisms with various biological functions,and they can exist alone or together with other proteins interacting(Protein-protein Interaction(PPI))to accomplish biological processes.PPI can have a significant impact on the function of living organisms and are fundamental to understanding biological processes and revealing disease mechanisms from a systems perspective.Therefore,the study of PPI is an important direction for biological and medical research,which also makes the accurate prediction of PPI has far-reaching research significance.Although in vivo and in vitro experiments are timeconsuming and labor-intensive,a large amount of PPI data has been identified for decades,while with the continuous development of deep learning,the use of computational models has become the preferred choice in the field of bioinformatics.In this paper,deep learning is used as a starting point to design a prediction model for PPI from the following aspects:1.Existing methods for multi-type PPI prediction have significantly lower test accuracy when using breadth and depth-first search to partition the dataset,so this paper proposes GDP(Gcn Doc PPI),a multi-type PPI prediction method based on Doc2 vec text embedding method and GCN graph convolution technique.GDP is mainly divided into four parts: protein embedding module,feature extraction module,graph convolution coding module and classifier prediction module.The protein embedding module solves the initial protein feature selection problem by adjusting the Doc2 vec unsupervised paragraph vector learning model to embed the protein sequence feature information of variable length into the low-dimensional vector space;the feature extraction module uses the stacking of one-dimensional convolutional networks to further integrate the features obtained from the protein embedding module and use multiple convolutional kernels to amplify the effective PPI multi-classification prediction for The graph convolutional coding module takes advantage of graph deep learning to fully combine the information of PPI network structure,aggregates the information of neighboring proteins of each protein,and optimizes the problem of coding representation of protein nodes;the classifier prediction module finds protein interaction edges based on the information of PPI network structure,combines the information of two protein nodes,and continuously learns from them for more efficient and accurate classification prediction.Experiments on three real datasets of different sizes,SHS27 k,SHS148k and STRING,show that GDP achieves the best results on the first two datasets,especially for PPIs between "new proteins" that do not appear in the training set.2.The GDP method can overfit on large datasets such as STRING,so the accuracy of prediction is not the most advanced result.To further solve this problem and improve the accuracy of multi-type PPI prediction,this paper proposes a multi-type protein interaction prediction based on Prot Trans and GAT(Graph Attention Network)PTGP is mainly divided into four parts: sequence-based feature extraction module,CNN(Convolution Neural Network)based feature extraction module,graph neural network based feature aggregation module and splicing prediction module.The sequence-based feature extraction module uses a pre-trained protein big language embedding model Prot Trans to initialize protein sequences to obtain preliminary features,which not only has high expressiveness in low-dimensional space but also greatly simplifies the downstream model;the CNN-based feature extraction module uses a onedimensional CNN as the feature extraction module to capture local features on protein sequences;the graph-based The feature aggregation module uses a multi-layer stacked GAT network,where the feature vectors of each protein are weighted and summed with the feature vectors of its neighbors to obtain a protein representation that incorporates multidimensional features,taking into account the information of surrounding proteins while learning more about their impact on itself.The splicing prediction module obtains the feature vectors of a pair of proteins through the above steps,splices them together,and inputs them into a multivariate classifier to predict the type of the PPI pair.The experimental results show that the model achieves the highest accuracy on three real datasets of different sizes compared with various other baselines and the GDP method proposed in this paper,and also for PPIs between "new proteins" that do not appear in the training set.3.The PTGP method reflects the importance of protein large language pre-training models for PPI prediction tasks,so this paper addresses another problem of PPI prediction-PPI dichotomous prediction,and proposes the protein interaction dichotomous prediction method ELP(Esm2Lstm PPI)based on ESM2 and LSTM.The ELP is divided into four parts: the ESM2 embedding module,the CNN module,the LSTM module and the binary prediction module;the ESM2 embedding module uses a protein large language pre-trained embedding model ESM2 to initialize protein sequences to obtain initial features;the CNN module uses CNN as a feature extraction module to further feature extract the protein embedding vectors obtained from the ESM2 module;the LSTM module,with its ability to solve the problem of PPI prediction,is a good solution to the problem of PPI prediction.The LSTM module,by virtue of its ability to solve the sequence length dependence problem,enables each amino acid to observe the amino acid feature information on the whole sequence,which is finally integrated into the protein feature information;the binary classification prediction module is connected and used as the input of the fully connected neural network to achieve binary classification prediction for PPI.On the yeast dataset,experimental results show that ELP achieves the most efficient performance on all metrics. |