Font Size: a A A

BERT-based Text Error Correction Model For Normative Documents

Posted on:2022-03-05Degree:MasterType:Thesis
Country:ChinaCandidate:S Q WangFull Text:PDF
GTID:2518306497452144Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computers and the Internet in my country,the application of Chinese text error correction technology has become more widespread.In order to ensure the standardization and rigor of the textual normative documents,the introduction of each administrative normative document has gone through the steps of conception,writing,revision,and finalization.Each part of these links is very important,and the repeated revisions and corrections are the manual proofreading of the text.Controlling the revision and proofreading can effectively avoid unnecessary errors.Therefore,applying Chinese text error correction technology to the text error correction of administrative normative documents and assisting staff to correct the text also has its important meaning.The main research contents of this paper are as follows:1.After studying the respective characteristics of Chinese text error correction tasks and administrative normative documents,this paper constructs a BERT-based normative document error correction model to complete the textual error correction of administrative normative documents.This model models four types of text errors(redundancy,missing,wrong sequence,typos,etc.)respectively,and is divided into two stages: error detection and error correction.The error detection stage checks whether the text is wrong,the location of the error,and the type of the error.In the error detection stage,a BERT-Bi LSTM-CRF sequence tagging model is selected for error correction tagging of sentences.The tags identify the location and type of errors in the sentence.Consider the specific entity content in the task to filter the identified sentences through an entity filter.Drop the entities that are misjudged as wrong content,and get the wrong label of the sentence.The error correction stage is modeled separately for the 4 types of errors.The method of redundant error correction is to directly delete the redundant content;the method of error correction of the error type is to reverse the error part;the method of correction of the missing error is to use the BERT mask The language model predicts the missing content and fills it back into the original sentence;the way to correct the typo-type error is to combine the BERT mask language model and the confusion set matching method.The best candidate word is selected by probability,and the two results are combined to calculate the sentence The degree of confusion determines the final answer.2.When faced with the task of correcting errors in the Chinese text of administrative normative documents,this article constructed a new data set focusing on the field of administrative normative documents.This article selects several normative documents provided by a certain city,filters them,and obtains nearly 10,000 administrative documents.For 4 different types of errors,4 types of redundant,missing,out-of-order,and typo types are automatically generated.Types of wrong sentences and mark the wrong parts to obtain a large-scale,automatically generated new data set of Chinese text error correction tasks focusing on administrative normative documents.After constructing a new data set and training this data set on the BERT-based normative document error correction model of this article,the experimental prediction results show that the effect is better than the classic Chinese text error correction project Pycorrector,which proves the normative document The revision of the error correction model is effective.
Keywords/Search Tags:Chinese text error correction, administrative normative documents, BERT, CRF
PDF Full Text Request
Related items