Font Size: a A A

Research On Named Entity Recognition Based On Ancient Book Corpus

Posted on:2021-09-30Degree:MasterType:Thesis
Country:ChinaCandidate:X GaoFull Text:PDF
GTID:2505306104989609Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
With the rapid development of computer technology,today we are living in an era of information explosion.How to search out the text we need from a large amount of Internet information quickly,accurately and efficiently has become a problem we need to solve.Named entity recognition,as an important research area of ??natural language processing,has received much attention in academia.And a large number of algorithms related to named entity recognition have emerged in recent years.Therefore,whether from a theoretical or practical perspective,the research on named entity recognition has important significance and value.But the current research mainly focuses on English corpus and modern Chinese corpus.Due to the characteristics of ancient texts,the research on named entity recognition based on ancient corpus can be described as the tip of the iceberg.With the continuous acceleration of the digitization of ancient books,if we can extract the physical information we need,it will have a huge effect on the field of ancient literature research and natural language processing.The corpus of this paper is selected from the classic ancient book Hanshu.By mining the characteristics of Hanshu.We use the most excellent transformer structure and Roberta pre training model technology in the current sequence task.We can automatically identify the names of people,place names,book names,time expressions and dynasties in Hanshu.Compared with the existing named entity recognition model,the performance has been greatly improved.Finally,this paper sorts out the four kinds of entities,such as person name,place name,book name and time quantity expression,which are identified by the model in the corpus of Hanshu.The paper consists of five chapters:Chapter 1 is an introduction,which explains the reason for choosing a topic from three aspects,including the development of computer technology,the necessity of identifying named entities of ancient books,and its value and significance.It also reviewsthe relevant theoretical background and the status of related research at home and abroad.Then,purpose and significance are elaborated.In addition,the main research methods are introduced.Chapter 2 focuses on the processing of the ancient Chinese corpus "Hanshu",which mainly includes the process and method of acquiring and processing the "Hanshu" corpus,and explains the reasons and labeling of the entity category selection.Chapter 3 is the focus of this article.First,a comprehensive analysis of three existing classic named entity recognition models based on rules and dictionaries,statistics-based methods,and deep learning methods is performed.Then we evaluate the performance of word vector and character vector in the ancient book Hanshu,and optimize the bilstm + CRF structure,which is the most popular model to process task of named entity recognition,based on the excellent transformer structure and open-source Roberta pre training model in machine translation,text classification and other tasks.Chapter 4 shows and tests the model.In this chapter,the code implementation of the classic named entity recognition model and the improved model is implemented.At the same time,the performance test is performed on the marked “Hanshu” data set.The CRF,Bi LSTM + CRF,Transformer,and pre-trained ROBERTa models are compared.After testing,it is found that the best model is the one which is fine tuned on the ROBERTa of pre training,the accuracy is93.40%,which is 12.75% higher than that of the Bi LSTM-CRF model based on word vector,and the F1 value is 12.41% higher.Chapter 5 summarizes the research work of this paper and prospects the follow-up work.The recognition of the named entity of the ancient book Hanshu can display its own value.The identified entities provide convenience for ontology research.In a certain sense,it fills up the lack of the current named entity recognition based on the ancient book corpus.The algorithm model built in this paper has great application value and reference significance for the research of named entity recognition based on Ancient Book Corpus,and will play a certain role in Computational Linguistics and natural language processing.
Keywords/Search Tags:Corpus of ancient books, “Hanshu”, Named entity recognition, RoBERTa model
PDF Full Text Request
Related items