| Named Entity Recognition(NER)is a fundamental task in Natural Language Processing(NLP),which aims to recognize the person’s name,location,organization,drug name,etc.in the sentences.NER is usually used as the first step of many natural language processing tasks including question answering system,information retrieval,reference resolution,topic model,etc.Recently,deep neural network has achieved remarkable results in NER and become the mainstream method of this task.However,many problems still exist in NER including:(1)Deep neural methods for NER typically require a large amount of annotated data,which is laborious and expensive to obtain in practice.(2)Most existing works assume clean data annotation,while real-world scenarios typically involve a large amount of noises from a variety of sources.The noise could hurt the performance of models badly.(3)Most works use pretrained word embeddings as the input layer,however,the embeddings which include semantic information unrelated to NER are not perfectly designed for the task.To solve the above problems,the main contributions of this paper are as follows:(1)To avoid laborious and expensive human annotation,we propose a method that can effectively make use of automatic annotations.Most of the automatically obtained annotations are partial annotations where only part of the entities are labeled.Partial annotations usually contains false negatives which are unrecognized entities.To this end,we propose a training framework towards relieving the harmful impact of false negatives.Specifically,we apply a span-based classification method as our backbone model which recognizes entities by classifying all subsequences of the sentence.Then,a well-designed pipeline is proposed to train the model step-by-step.Experiments on two benchmark datasets show that our proposed framework outperforms state-of-the-art methods.We also show the effectiveness of our approach with a real-world task of Swedish Biomedical NER in a practical setup.(2)To alleviate the bad effect of the noisy data on models,we propose a method based on confidence score estimation.Based on empirical observations of noisy and clean labels,we propose strategies for estimating confidence scores based on local and global independence assumptions.We partially marginalize out labels of low confidence with a CRF model.We further propose a calibration method for confidence scores based on the structure of entity labels.We integrate our approach into a self-training framework for boosting performance.Experiments in general noisy settings with four languages and distantly labeled settings demonstrate the effectiveness of our method.(3)To reduce unrelated information in word embeddings,we propose a prism module to disentangle the semantic aspects of words,only keep the semantics related to the task.In the prism module,some words are selectively replaced with task-related semantic aspects,then these denoised word representations can be fed into downstream tasks to make them easier.Besides,we also introduce a structure to train this module jointly with the downstream model without additional data.The experiments shown that our method could significantly improve the performance of baselines on Named Entity Recognition(NER)task. |