Font Size: a A A

Research On Chinese Spelling Check With Data And Knowledge Enhancement

Posted on:2024-03-05Degree:MasterType:Thesis
Country:ChinaCandidate:Q LvFull Text:PDF
GTID:2568306941463874Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Chinese spelling check is one of the important tasks in natural language processing,the main purpose of which is to automatically identify and correct misspelled words in user input.However,due to the complexity and ambiguity of the Chinese language,Chinese spelling check is more challenging than English spelling check.For the Chinese spelling check task,a large amount of corpus and training data is often required to achieve better performance.However,due to the difficulty in collecting and updating data for Chinese spelling check,there is often not enough training data,making it difficult for the current Chinese spelling check model to adapt practical applications.Therefore,this paper builds a strong baseline model for this task and conducts related research in data augmentation and knowledge enhancement,focusing on the characteristics of the task.The specific content is as follows:Firstly,based on the inherent properties of Chinese characters,this paper proposes a feature-enhanced Chinese spelling check model based on sound and shape of characters.On the basis of modeling sentence semantics,this paper uses separate pinyin and character shape encoders to independently model the sound and shape of each Chinese character in the sentence,constructing a strong baseline model.Secondly,to address the lack of training data in the Chinese spelling check task,this paper proposes a new pretraining task for data augmentation,namely error consistency pretraining task.In addition,to compensate for the lack of continuous errors in the current dataset in real-world scenarios,we also supplement the continuous character confusion set based on the single-character confusion set in previous work.Finally,this paper introduces user dictionaries as external knowledge and proposes a knowledge-enhanced Chinese spelling check framework based on the user dictionary called UD.This framework can be applied to any spelling check system based on token classification models and automatically adapts to different correction scenarios in different fields without requiring additional training data.
Keywords/Search Tags:Chinese Spelling Check, Pretrained Language Model, Data Augmenta-tion, User Dictionary
PDF Full Text Request
Related items