Font Size: a A A

Research On Chinese Named Entity Recognition Based On Feature Enhancement

Posted on:2022-11-10Degree:MasterType:Thesis
Country:ChinaCandidate:P ZhaoFull Text:PDF
GTID:2518306608490084Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Named entity recognition is a basic task of natural language processing,and how to accurately identify named entities is very important.At present,English named entity recognition has made great progress.Compared with English,Chinese has complex grammatical structure and semantics,and there is no obvious boundary between words,so entity recognition is more difficult.Therefore,this paper takes Chinese named entity recognition as the subject,and makes an in-depth study.According to the different embedding vector of models,Chinese named entity recognition methods are divided into character-level embedding and word-level embedding.The word-level embedding method has a strong dependence on word segmentation and has serious word segmentation errors.Although the character-level embedding method avoids word segmentation errors,it only embeds character vector into the model,lacking word boundary and word meaning information,and the recognition effect needs to be improved;Secondly,when there are out of vocabulary words and irrelevant words in the text,the model learning is insufficient,which often causes entity recognition errors due to ambiguity,which reduces the performance of the recognition algorithm.Besides,there are fuzzy words in the corpus,that is,one entity can correspond to multiple entity types or the entity contains other entities,so the boundary and types of entities are difficult to determine.(1)An Attention Adaptive Model with Word Information(AAMWI)named entity recognition model is proposed.In this model,the character embedding vector and the word information embedding vector of the sentences are fused as the input of the model,and the feature enhancement is realized by adding word-level information;In the coding layer,the attention mechanism of Adaptive Distribution Selection(ADS)is designed.By introducing a dynamic scaling factor,the attention distribution of related entities and irrelevant words is adjusted adaptively according to the output of the hidden layer.To some extent,the interference of irrelevant words to the model is reduced,and the performance of named entity recognition is improved.(2)A new word discovery method based on mutual information and adjacency entropy is proposed to build a domain dictionary.On this basis,word information marker vectors are generated to obtain the position information of characters in related words,and then word information embedding matrix of sentences is formed.This method expands character for character in the original corpus,uses mutual information,adjacency entropy and cooccurrence matrix to screen new words,and realizes the combination of rule-based and statistical methods.Compared with the N-Gram algorithm,there are not many repeated word strings in the recognized new words,and satisfactory results are achieved.(3)Enhanced Feature Embedding with Prior Knowledge(EFe PK)named entity recognition model is proposed.The model integrates the prior knowledge embedding vector and the shallow embedding vector as the enhanced feature embedding,and realizes feature enhancement through word-level joint distribution of reflected entity features,which avoids word segmentation errors and alleviates the out of vocabulary words to the model to a certain extent.Secondly,the CRF decoding layer of tag merging is designed,which merges the tags of fuzzy words and completes the sequence labeling task through CRF,which has great advantages in the recognition of fuzzy words.(4)The prior knowledge embedding matrix of the sentence is designed.Firstly,the word frequency vector and the ordered co-occurrence matrix are obtained by traversing the corpus related to the input sentence.On this basis,the latent entity tag vector and the ordered mutual information of the input sentence are further constructed,and then the prior knowledge embedding matrix of the sentence is obtained.The matrix contains the prior knowledge of relevant corpus,which reflects the different distribution probabilities,grammatical structures and context information of characters in potential entities,enhances the characteristics of entity boundaries of characters,makes it easier for the model to capture word boundaries in the training process,improves the generalization ability of the model to a certain extent,and has a better recognition effect on the corpus with many out of vocabulary words and irregular texts.Experiments were conducted on economic and news datasets such as Resume and MSRA,as well as Weibo and Novel datasets with irregular texts and many out of vocabulary words.Compared with other models,the accuracy of entity recognition of AAMWI and EFe PK models has been greatly improved.Among them,AAMWI model reduces the interference of irrelevant words through ADS attention mechanism,better extracts the fusion features of characters and words,and obtains satisfactory results in economic and news datasets.
Keywords/Search Tags:Attention adaptation, Word-level information, Chinese named entity recognition, New word discovery, Out of vocabulary words, Feature enhancement, Priori knowledge
PDF Full Text Request
Related items