Font Size: a A A

Extraction Of Protein-protein Interaction Based On Distant Supervision

Posted on:2019-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:Q K MinFull Text:PDF
GTID:2370330596951108Subject:Engineering
Abstract/Summary:PDF Full Text Request
Protein-protein interaction(PPI)is one of the most important research content in the field of biology and medicine,which is of great significance for treatment of diseases and development of new drug.The relevant information about PPI obtained by biomedical experiments is mainly stored in papers by the form of unstructured texts.With the rapid growth of biomedical papers,the method of collecting PPI information manually is difficult to meet the requirement of practical application.Therefore,it has become an essential research topic in the field of bioinformatics to extract PPI relation from biomedical literatures automatically.Distant supervision-gathering large scale of training data by aligning a knowledge database with unstructured text which efficiently reduces dependence on manual annotation dataset is currently the most common approach for PPI interaction extraction.However,this introduces a large amount of noise data in the process of gathering training data,which will greatly impact the extraction performance of model.Given this problem,we first build a basic PPI interaction classification model trained by data based on distant supervision and tested by manual annotation.We further analyzed the existence of noise in training data.In the following,we build a PPI extraction model based on topic collection.The topic collection corresponds to the protein pair is extracted by keywords and sentence similarity based on cross prediction,and sentences outside topic collection are regarded as noise in training data to remove.Our model is trained by clean data and tested by manual annotation corpora.We use different combinations of parameters to test the performance of our model and compared to the basic distantly supervised model.The experiment results show that F1 measures of interactive and non-interactive are increased by 1.49% and 9.18% respectively,which means our model is effective in removing noise in training data.Besides,in order to make the most of relationships between sentences classifications,we introduce multi-instance multi-label learning model for PPI extraction,which jointly models all the sentences of a protein pair signature and all their labels with latent variables.Maximum expectation algorithm is used to classify the sentences and remove noise in them iteratively.The experiment results show that the iterative algorithm based on multi-instance multi-label learning is more accurate on noise identification.Compared with the basic distant supervised model,on the basis of a slight increase in the F1 measure of interactive protein pairs,the F1 measure of non-interactive protein pairs is increased by 14.84%.The performance of the model is improved and the results are more balanced.
Keywords/Search Tags:Protein-Protein Interaction, Distant Supervision, Noise Data, Topic Collection, Multi-Instance Multi-Label, Expectation Maximization Algorithm
PDF Full Text Request
Related items