Research And Implementation Of Chinese Text Proofreading Algorithms Based On Deep Learning

Posted on:2023-03-21

Degree:Master

Type:Thesis

Country:China

Candidate:Z Liu

Full Text:PDF

GTID:2558307073983049

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the information age,much text work has gradually been transferred to computers,and the number of electronic texts has snowballed.While a large amount of textual information on the Internet has enriched our lives,the explosive growth of data has inevitably led to a significant decline in the quality of textual data and a significant decrease in the efficiency of our access to information.The traditional manual proofreading method cannot cope with such a large amount of data,and there is an urgent need for computeraided Chinese text proofreading methods to help or even replace manual proofreading.After in-depth research on text proofreading at home and abroad,the contributions of this thesis are threefold as below:1.An end-to-end Chinese spelling checking algorithm model BFMBERT(Bi GRU-Fusion Mask BERT)that incorporates multi-feature embedding of Chinese characters is proposed.The model first uses a pre-training task combining confusion sets to make BERT learn Chinese spelling error knowledge.It then employs a bi-directional GRU network to capture the probability of error for each character in the text.Furthermore,it applies this probability to compute a fusion embedding incorporating Chinese characters’ semantic,pinyin,and glyph features.Finally,it feeds this fusion embedding into a Mask Language Model in BERT to predict correct characters.BFMBERT is evaluated on the SIGHAN 2015 benchmark dataset and achieves an F1 value of 82.2,outperforming other baseline models.2.A Chinese grammatical error correction model CGECSE based on sequence-to-edits is proposed.Multiple character-level edit labels and a sequence transformation method that can explicitly represent the editing process from an incorrect sentence to a correct one are defined.After the Transformer-based encoder,CGECSE predicts the edit label of each character in the sentence by the edit label prediction layer,predicts the error probability of a character by an error probability prediction layer,and proofreads the grammatical errors of the sentence through editing processing combined with filtering error confidence.The model uses sequence editing to replace the sequence-to-sequence model to deal with Chinese grammatical error correction,which makes up for the autoregressive model’s slow inference speed and improves the model’s interpretability.In addition,the source-end dropout and multi-granularity noise data enhancement methods to alleviate the problems of small-scale Chinese grammatical error correction data and model overfitting are also proposed.Experiments show that the performance of CGECSE is up to expectations,outperforming other models on the NLPCC2018 benchmark test set.3.The Chinese text proofreading system is designed and implemented through a multiside separated development approach.The proofreading service API is developed by Flask,the Chinese text proofreading back-end system is developed through Spring Boot and the front-end interface is developed by Vue.js.Finally,a low-coupling Chinese text proofreading system is designed and implemented,which provides online and offline proofreading functions,and verifies the availability of the proposed Chinese spelling checking model and Chinese grammatical error correction model.

Keywords/Search Tags:

Chinese text proofread, Deep learning, Pre-training model, Mask language model, Sequence to edits

PDF Full Text Request

Related items

1	Chinese Text Automatic Proofreading System
2	Research On Chinese Text Summarization Based On Deep Learning
3	Research On Sentiment Analysis Of Chinese Barrage Text Based On Deep Learning
4	Research And Application Of Related Techniques For Text Summarization Based On Deep Learning
5	Research On Deep Learning Based Chinese Scene Text Detection And Recognition
6	Research On Computer Virus Signature Automatic Extraction Technique
7	Research On Automatic Text Summarization Generation Technology Based On Deep Learning
8	Improvement And Research Of Sequence-to-Sequence Model For Chinese Text Summarization
9	Research On Chinese And English Text Entity Recognition Technology Based On Pre Training Language Model
10	Study On Chinese Word Segmentation Based On Recurrent Neural Network Language Model