Font Size: a A A

Weakly Supervised Protein-protein Interaction Identification Based On Complex Network And Graph Embedding Representation

Posted on:2020-10-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y W MaoFull Text:PDF
GTID:2370330590972674Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Protein-protein Interaction(PPI)is a very important research direction in the field of biomedicine.Protein-protein interaction is of great significance for the discovery of new drugs and the diagnosis of diseases.The current PPI relationship is mainly stored in the form of literature.With the rapid development of medical literature,querying PPI relations often brings difficulties to relevant researchers.Therefore,how to automatically identify protein-protein interactions from the literature has become an important research topic for many researchers.Commonly used protein-protein interaction recognition algorithms are usually based on supervised learning algorithms.Although this method can achieve better results,it requires a large amount of labeled data,which is often difficult to apply in practice.Therefore,this paper proposes a PPI identification method based on weakly supervised learning.In this paper,we first use a professional database to collect the target protein pairs and all the sentences containing the target protein pairs to construct the signature,and use a small number of interacting protein pairs as a seed set.Then,the feature that can express the text relationship is extracted from each sentence as the lexical pattern,and the lexical pattern is expressed as a vector according to the distributed hypothesis principle.After that,we used some lexical patterns from corpus,which are similar to the seed lexical pattern,to construct the candidate set.Finally,through the evaluation of the candidate set,the protein pairs which are higher than the threshold are selected and added to the seed set.And the above process is iterated,and the interaction relationship is recognized by the continuous iterative expansion of the seed set.This method only needs a small amount of label data to achieve better results,and the F-score is up to 67.35%.Next,because the weakly supervised method may introduce some noise protein pairs that are not related to the seed set during each iteration,that is,semantic drift problem.We propose to use the complex network model to further evaluate the candidate sets,effectively reducing noise in each iteration,alleviating semantic drift problems.The accuracy of this method is obviously improved on the weakly supervised basic model,and the F-score is also improved.The highest F-score can reach 68.14%.Finally,this paper proposes to generate a lexical pattern vector using the graph embedding method.The method can effectively combine the word information contained in the traditional one-hot representation,and the semantic relationship information contained in the representation based on the distributed hypothesis method to achieve a better representation.The experimental results show that this new representation method effectively improves the accuracy,recall and F-score of the PPI identification algorithm.When the F-score is the highest,the three evaluation values are 70.96%,71.00%,70.98%,and the performance of the model is obvious improved.
Keywords/Search Tags:Protein-Protein interaction, weakly supervised learning, lexical pattern, key word, complex network, graph embedding
PDF Full Text Request
Related items