Font Size: a A A

Research On Chinese-Vietnamese Entity Alignment Technology Based On Named Entity Recognition

Posted on:2021-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:L TianFull Text:PDF
GTID:2415330647461962Subject:Engineering
Abstract/Summary:PDF Full Text Request
The aligned bilingual corpus plays an important role in the fields of machine translation,word sense disambiguation and bilingual dictionary compilation.The unit of corpus alignment has different levels such as chapters,paragraphs,sentences,phrases,words,etc.The smaller the unit granularity,the more detailed the information provided.However,the difference between different languages has brought difficulties to the text preprocessing,which makes it more difficult to the work of automatic alignment corpus.In terms of entity-level alignment,there is no relevant research on bilingual alignment between the language of Chinese and Vietnam.In order to achieve Chinese-Vietnam entity alignment and further expand the bilingual corpus,this paper concentrates on Vietnamese named entity recognition technology,and proposes a Chinese-Vietnam bilingual entity alignment method based on named entity recognition.The main work of the paper is as follows:(1)Aiming at the problems of lacking of Vietnamese corpus,and the difficulty of manual construction,we construct a Vietnamese entity recognition data set,which establishes a data set for Vietnamese entity recognition tasks at a low cost of manual intervention,and solves the problem of the scarcity of tagging corpus.(2)Based on BERT-GRU-CRF,we build a Vietnamese named entity recognition model,and implement it well.The model performs word vector processing through the BERT(Bidirectional Encoder Representations from Transformers)layer,uses the GRU(Gated Recurrent Unit)layer to extract the semantic features of the vector,and predicts entity labels through CRF(Conditional Random Field).In comparison with the related commonly model,the F1 values of the names of people,places and institutions reached92.98%,95.86% and 88.77% respectively.This model has better performance in processing Vietnamese data.(3)Because of the little research on entity alignment technology between the language of Chinese and Vietnam,we provide a Chinese-Vietnamese bilingual named entity alignment scheme based on named entity recognition.The scheme where includes the process of word alignment,named entity identification,and entity alignment,integrates the word alignment result and entity identification result to obtain the aligned Chinese-Vietnamese bilingual entity.Through experimental analysis of the names of people,places and organizations,the F1 values reached 74.98%,78.20% and 65.76%respectively,it is shown that the scheme can extract the aligned Chinese-Vietnamese bilingual entities effectively.
Keywords/Search Tags:Named Entity Recognition, Dataset Construction, Bilingual Entity Alignment, Word Alignment
PDF Full Text Request
Related items