Font Size: a A A

Research On Key Technologies Of Knowledge Extraction In Low-Resource Scenarios

Posted on:2022-11-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Z LiFull Text:PDF
GTID:1528307169976349Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Knowledge extraction aims to extract specific factual information from unstructured text automatically to construct structured knowledge.It plays an important role to support downstream tasks such as text analysis,knowledge graph construction and question answering systems.Knowledge extraction mainly includes two subtasks: named entity recognition and relation extraction.Nowadays,deep learning-based models have achieved excellent performance on many benchmark data sets for named entity recognition and relation extraction,but their good performance relies on a large amount of labeled data.In practice,high-quality labeled data are often difficult to obtain,especially in the specific fields such as medical,finance,and technology,where manual labeling is more time-consuming and laborious,and the labeled data are easily to be out of date.It is difficult to provide sufficient labeled data for deep learning models.To address this issue,this thesis focuses on improving the performance of deep learning models for named entity recognition and relation extraction in low-resource scenarios,where only a small amount of labeled data are available.The main idea to address low-resource constraints is to use more information sources or explore more efficient learning algorithms.For named entity recognition and relation extraction tasks,this work studies three settings: semi-supervision,distant supervision and few-shot learning,and conducts research from three aspects: data selection,model training and knowledge transfer.The main contents and contributions are listed as follows:Semi-supervised named entity recognition focuses on using a large amount of unlabeled data to improve the performance of a named entity recognition model under a limited amount of labeled data.Existing methods based on the low density separation assumption label unlabeled data with pseudo-labels to provide new training data,but the errors in pseudo labels seriously damage the performance of the named entity recognition model.From the perspective of data selection,this work builds a scoring model to evaluate the confidence of pseudo-labels,and proposes a clause screening strategy to filter noisy pseudo labels to generate new pseudo labeled data.Through the iterative process of improving the classification model by the pseudo labeled data and generating higher-quality pseudo labeled data based on the classification model,this work achieves the goal of using unlabeled data to improve the performance of the named entity recognition model.The proposed method consistently improves the performance of deep learning based models on two classical English data sets and one Chinese medical data set.Distantly supervised relation classification focuses on using a large amount of automaticallygenerated noisy labeled data to train robust neural relation classification models.Existing methods rely only on noise data to mine strong patterns to guide instance selection and they are prone to unavoidable incorrect patterns.From the perspective of model training,this work introduces a small amount of manually labeled reference data to guide meta instance reweighting for robust neural model training.As the instance weights are automatically adjusted through the meta-learning algorithm to minimize the loss on the reference data,when the reference data are insufficient,the meta instance reweighting algorithm suffers from model collapse issues.To handle this problem,this work proposes to select highly reliable instances(elite data)from noisy data to augment the reference data,thus enhancing meta instance reweighting.On two distantly supervised data sets,applying the proposed method on two basic neural network models,their performance significantly improved,outperforming existing state-of-the-art methods.This indicates the effectiveness of meta instance reweighting algorithm to train robust neural network models with noisy data sets.More detailed experiments show that the combination of reference data and elite data further enhances the meta instance reweighting algorithm.Few-shot relation classification focuses on training a classification model that can distinguish newly-emerged relations with only a few labeled instances.A few labeled instances may be insufficient to provide enough features to distinguish a relations.In contrast,the definition text of a relation(such as relation names or descriptive text)can provide essential semantic information about the relation,but it is difficult to be effectively integrated into the relation classification task.From the perspective of knowledge transfer,this work proposes a method of pre-training a prototype encoder,which transfers the knowledge of the relation definition text to the prototype representation for relation classification,thus providing prior knowledge for the few-shot relation classification.The prototype encoder encodes the relation definition texts into vector representations,based on which,the relation classification model classifies instances into the relation with the closest representation.This work proposes to combine an instance encoder and a prototype encoder as a multi-instance relation classification model and train it on a large-scale distantly supervised relation classification data set.After the joint training process,the pre-trained prototype encoder successfully maps the semantic of a piece of relation definition text to its prototype vector.Applying the prior knowledge provided by the generalpurpose prototype encoder to the relation classification model,the proposed method significantly improves the benchmark model on Few Rel 1.0 and the domain-adaptation 2.0data set and achieves the state-of-the-art performance on Few Rel 1.0.The experimental results on NYT-10 and Pub Med-25 data sets show that the pre-trained instance encoder and prototype encoder are effective to perform zero-shot relation classification,and they also present promising continual relation classification performance with a few labeled instances.According to the three settings of low-resource scenarios for knowledge extraction,this thesis proposes three types of model-agnostic solutions from the aspects of selecting pseudo labeled data,training robust neural network models,and transferring knowledge from other sources.Abundant experiments under the corresponding settings verify that the above solutions can effectively use unlabeled data to enhance the named entity recognition model,use distantly supervised data to train a robust neural relation classification model and use prior knowledge to enhance the few-shot relation learning.In the future,the author will try to extend the above methods to other natural language processing tasks in similar low-resource scenarios.
Keywords/Search Tags:Named Entity Recognition, Relation Extraction, Low-Resource, Deep Learning, Distant Supervision, Meta Learning, Few-Shot Learning
PDF Full Text Request
Related items