Font Size: a A A

Application Research Of Bi-LSTM-CRF Model In Chinese Grammar Error Diagnosis

Posted on:2020-10-25Degree:MasterType:Thesis
Country:ChinaCandidate:S LiuFull Text:PDF
GTID:2415330578452713Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the increasing international status of China,Chinese learning has become more and more important for the development of international learners.The goal of the Chinese Grammar Error Diagnosis(CGED)task discussed in this paper is to develop a computer-assisted auxiliary tool.This tool can not only help foreign learners who use Chinese as a second language to learn Chinese better,but also relieve the pressure of teachers who teach Chinese.The aim of Chinese Grammatical Error Diagnosis research is to establish a model that can automatically detect the errors and their locations made by learners in the process of Chinese writing.In this study,errors can be divided into four categories;redundant words,missing words,bad word selection,disorder words.The difficulty of Chinese grammar error diagnosis research is that the task involves different levels of information in natural language processing,including lexical analysis and syntactic analysis of Chinese.Therefore,it is necessary to consider all aspects to assist in the judgment.In addition,Chinese contains a wealth of linguistic knowledge,and the grammatical representations are diversified.When judging whether a sentence contains errors and what types of errors,it is often necessary to introduce external knowledge.In view of this,this paper proposes to use pyltp for data preprocessing.The personalized word segmentation feature of pyltp is more suitable for this task.This is because the datasets for Chinese Grammatical Error Diagnosis mostly come from Chinese essays written by different foreign students,which involve many different topics.Personalized word segmentation can alleviate subject dependence to a certain extent.When facing the new topic,the user only needs to label a small amount of data,and personalized word segmentation will be incremental training based on the original data.In order to achieve both the use of the original subject data information,but also take into account the particularity of the target theme.In addition,this paper proposes to use Bidirectional Long Short-Term Memory Network(Bi-LSTM)to model,which can better use two-way context information to determine whether the sentence is wrong.On this basis,we regard Chinese Grammatical Error Diagnosis as a special Sequence Labeling task to solve.For Sequence Labeling,Conditional Random Field(CRF)model has better performance than traditional Hidden Markov Model(HMM)and Maximum Entropy Markov Model(MEMM),and Bi-LSTM model can also alleviate the shortcomings of artificial feature selection and difficulty in capturing long-distance context information dependence in CRF model.Therefore,this paper further proposes to combine Bi-LSTM with CRF model.Among them,Bi-LSTM is used to obtain long-distance information in two directions,and then provide information to the CRF model for sequence labeling.The experimental results on the task open standard evaluation data set show that the Bi-LSTM-CRF model proposed in this paper is more effective than the Bi-LSTM model or CRF model alone in Chinese Grammatical Error Diagnosis tasks.
Keywords/Search Tags:Chinese Grammatical Error Diagnosis, Bidirectional Long Short-Term Memory Network, Conditional Random Field, Sequence Labeling
PDF Full Text Request
Related items