Font Size: a A A

Study On Denoising Algorithm For Distant Supervision Relation Extraction

Posted on:2022-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:H DingFull Text:PDF
GTID:2518306605968589Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
In recent years,knowledge has been constantly promoting the development of artificial intelligence,leading the scientific and technological progress.Therefore,how to effectively discover knowledge becomes a challenging problem.As an important branch of information extraction,relation extraction aims to extract structured knowledge from unstructured text to express the semantic relation between two entities.Relation extraction based on distant supervision is a widely used method at present.It automatically labels large-scale corpus through an external knowledge base as the supervision source.Although a large number of labeled data can be obtained by distant supervision,noise is inevitably introduced.At present,most methods of the relation extraction methods based on distant supervision are improved through the construction of complex models,the introduction of external knowledge or the optimization of training strategies.However,the noise problem in distant supervision for relation extraction remains unsolved well.In order to denoise in distant supervision for relation extraction,this paper presents Iterative Denoising Method with Pattern Fusion Network which we named ID+PFN.First,a small number of positive data is obtained through the filter operation based on the patterns which is found by the unsupervised methods,then the remaining positive data is found as much as possible through greedy strategy training,in this way,the noise can be removed from the distant supervision data and the purpose of denoising can be achieved.At the same time,a relation extraction network with pattern fusion is designed,so as to enhance the model’s attention to the words that can reflect the relation semantics and improve the denosing ability further.The main study contents of this paper are as follows:(1)In order to denoise for distant supervision relation extraction,the ID+PFN method is proposed in this paper.First,a method of discovering a small number of positive data is designed.Although the noisy data is chaotic,patterns which express the appropriate semantic relation are similar according to the feature of the text.Therefore,this paper utilize frequency sorting and k-means cluster methods to obtain patterns that express appropriate semantic relation in distant supervised data.The final relation pattern is token from the intersection result.A small number of positive data is discovered through the operation of filtering by these patterns.Then,an iterative training framework based on greed strategy is designed.A small positive data can be found through unsupervised methods,but it still causes a waste of positive data in distant supervision.In order to improve utilization of positive data,this paper designs a training method based on greed strategy.With the training operation by small number of positive data,the model can complete the relation extraction of clear semantic.Then this paper regards the denosing problem as the process of solving the optimal solution of the positive data in distant supervision data,the remaining data will be scored during each iteration,and the high-scoring data will be selected and added to the next training data set through the feedback of the scoring mechanism.In this way,positive data can be included in the training as much as possible,so as to discard the noisy data in the distant supervision and achieve the purpose of denoising.Finally,a relation extraction network with pattern fusion is designed.In order to solve the problem of insufficient attention to patterns in the existing methods,this paper,proposes a relation extraction network with pattern fusion,which combines the important features that patterns can explain the relation semantics of text.In this paper,on the basis of the bidirectional long short-term memory neural network,a pattern-based attention method is designed to guide the model to attend on the pattern.In the training process,the loss of attention mechanism and the loss of relation extract model are optimized together to complete the fusion of the pattern and improve its denoising ability further.(2)In this paper,F1-score,P-R curve and other indicators are used to verify the algorithm in the standard NYT data set,the result shows that the effectiveness of the denoising algorithm is verified.Through the comparison experiments with PCNN+ATT,CNN+RL and other state-of-the-arts models,the improvement of F1-score is about 2.36%,which proves that the algorithm proposed in this paper can solve the noise problem robustly and reduce the impact of noisy data effectively in distant supervision for relation extraction.
Keywords/Search Tags:distant supervision, denoise, relation extraction, information extraction, attention mechanism, deep learning
PDF Full Text Request
Related items