Font Size: a A A

Research On Ultra-fine Entity Typing Based On Self--training Denoise And Feature Fusion

Posted on:2022-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y F HuFull Text:PDF
GTID:2518306551953479Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The ultra-fine entity typing(UFET,also known as extremely fine-grained entity typing)task aims to classify the named entities in the text.It is a crucial task in information extraction.The finer classification can provide richer semantic connotations for information extraction.This technology also offers essential support for downstream tasks such as information retrieval,Question Answering,Knowledge Graphs,Text Mining.As a critical task,UFET needs to elim-inate semantic ambiguity,automatically and quickly discover appropriate and accurate entity fine-grained categories meaning according to the entity’s context.UFET has many categories(more than a thousand categories),crowd labeling became ex-tremely difficult,forcing researchers to use distant supervision data obtained from the Internet by rules as training data.However,the distant supervision data set generally has more than20% noise,which is the bottleneck of the current development of UFET.Therefore,this thesis proposes to remove the noise data in the ultra-fine-grained data set through self-training learn-ing.On the other hand,with the increase in the number of categories and data,the dependence of UFET on feature extractors is also increasing.Feature extraction is the first step in natural language processing,and it is also another short board of UFET.Obtaining a feature extrac-tor with excellent performance requires a lot of money and time for researchers.This thesis proposes fusing existing feature extractors to obtain feature extractors with more semantic in-formation under this premise.The UFET method based on self-training denoising and feature fusion proposed in this thesis is detailed as follows:(1)Aiming at the noise problem in UFET distant supervision data,this thesis proposes a UFET method based on self-training denoising.The self-training denoising method relies on the global and local data distribution consistency to purify distantly supervised data.The de-tailed steps of self-training denoising are as follows: divide the distant supervision data set into entities-disjoint sub-datasets? model the local data distribution based on the sub-dataset? model the global data distribution based on the statistics of the local distribution? compare and eval-uate the new and old global data distributions? Cut out distant monitoring data sets of varying degrees of purification.The new model is trained based on the optimized distant supervision data set after denoising.The new model improves the performance and prevents the overfit-ting of the model to the supervised data set.Besides,randomly selecting the denoised data for manual evaluation and transferring the self-training denoise method to a different dataset for experiments.The result proves the self-training denoising method’s stable and robust de-noising ability.Simultaneously,it also proves that the self-training method without a priori is universally applicable.(2)Aiming at the problem that complicated feature extractors are challenging to optimize,this thesis proposes a UFET method based on feature fusion.The method provides a new way for the feature extraction method of UFET,which can obtain more textual feature information by only using a small number of computation resources.The existing feature extractors can obtain richer information by trying linear fusion and cascade fusion of different feature extractors.This thesis proves that cascade fusion has better fusion performance for feature extractors with lower similarity.Simultaneously,through the relationship between model parameters and model performance,a feature fusion method suitable for the complexity of current UFET models is demonstrated.This article uses a public data set commonly used in-domain of UFET,a cross-domain data set for experiments,and uniformly recognized evaluation standards to measure the experimental results.The experimental results show that the UFET algorithm proposed in this thesis exceeds the existing state-of-the-art method.Based on the UFET method proposed in this article,we have achieved the first domestic comprehensive score in the Entity Discovery and Linking(EDL)tasks in the Text Analysis Con-ference - Knowledge Base Population(TAC-KBP)organized by the National Institute of Stan-dards and Technology(NIST)in 2019.Achievements,the teams participating in this task also include Tencent Artificial Intelligence Laboratory,Alibaba DAMO Academy,IBM Research,UIUC,CMU,and other universities and research institutions from domestic and foreign.At the same time,in the application of China Knowledge Center for Engineering Sciences and Technology(CKCEST),Creatively apply a UFET model on the annotation tools to assist anno-tators to complete the labeling of entity types.The cumulative labeling of 140 documents and4030 entity data proves that this annotation’s promotion can enable annotators to double their efficiency.
Keywords/Search Tags:Ultra-fine Entity Typing, Distant Supervision, Feature Fusion, Self-training, Data Denoise
PDF Full Text Request
Related items