Font Size: a A A

Protein-protein Interaction Sites Prediction Based On Natural Language Processing

Posted on:2023-09-05Degree:MasterType:Thesis
Country:ChinaCandidate:N WangFull Text:PDF
GTID:2530307103481384Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
Protein-protein interaction sites prediction is crucial for practical applications such as drug development.However,protein-protein interaction sites prediction by experimental methods is time-consuming and laborious.Therefore,protein-protein interaction sites prediction by computational methods is becoming increasingly popular.With the rise in the field of natural language processing,considering the structural similarity between protein sequences and text sequences,some studies have tried to use natural language processing technology to predict protein-protein interaction sites.Inspired by these literatures,this thesis tries to study the performance of the classic text classification and relation classification models on protein-protein interaction sites prediction.At the same time,an attention-based feature fusion mechanism is also adopted to improve the classification accuracy by using key information without losing background information.First,each word embedding vector is used to represent each site.According to previous research methods,global features and local features are extracted for the current predicting site.Second,this thesis proposes different embedding processing methods to process the extracted features,namely densification for one-hot vectors of global features,average pooling for global features’ physical and chemical properties,weighting and calculating attention for local features,and multi-head cross-attention mechanism used to fuse global features and local features.Third,the features processed by the embedding methods are input to Textcnn module,Textrnn module,Textrcnn module and Textrnn-Attention module respectively to further extract effective information,and the Textrcnn module is partially changed by combining bidirectional long and short-term memory network and convolutional neural network.Finally,the extracted information is input to the classification block for classifying.According to different combinations of embedding processing methods and information extraction methods,16 models are constructed in this thesis.In addition,the number of heads in multi-head cross-attention layer,one of the hyper-parameters,are fine-tuned for ELCAI-Textcnn model.Finally,three models namely EL-TextrnnAttention model,ELCA-Textcnn model and ELCAI-Textcnn model are selected to compare with other methods in the same test set.The accuracy of above three models reaches 0.659,0.678 and 0.666,and AUPR are 0.330,0.331 and 0.329,respectively.Compared with other literature(Deep PPISP),the accuracy is improved by 0.611%,3.511% and 1.679%,and AUPR is improved by 3.125%,3.323% and 2.813%,respectively,which improves the performance of protein-protein interaction sites prediction to some extent and provides an alternative model for drug development and other practical applications.At the same time,it is demonstrated that it is feasible to transfer the natural language processing techniques such as text classification and relation classification models to the protein-protein interaction sites prediction task with the feature fusion mechanism,which provides theoretical reference for the subsequent researches.
Keywords/Search Tags:protein-protein interaction sites prediction, natural language processing, feature fusion techniques, multi-heads cross-attention mechanism
PDF Full Text Request
Related items