| In recent years,with the development of China’s overall national strength and the improvement of its international status,more and more foreigners have begun to learn Chinese.This is of positive significance for the exchanges and learning cooperation between China and other countries in the world.However,when second language learners are learning Chinese,various grammatical errors will occur due to the negative transfer of mother tongue and other reasons.The use of standardized grammar is an important way to improve the writing level of second language learners.The significance of the automatic diagnosis of grammatical errors in Chinese as a foreign language is to discover some common error rules caused by the negative transfer of mother tongue,and thus play a guiding role in teaching Chinese as a foreign language.Therefore,the research on the task of automatically diagnosing grammatical errors of second language learners has practical significance.At this stage,the solutions for the automatic diagnosis of Chinese grammatical errors mostly use Bi LSTM-CRF as the basic model,supplemented by feature engineering,random seed tuning,model integration or manual rules and other methods for optimization.However,it can be found from the evaluation reports and academic papers in recent years that the diagnosis results of the basic model are not ideal.The scarcity of Chinese grammatical error annotation corpus and the high rate of false positives in diagnostic models are problems that need to be solved urgently.Aiming at these two problems,the main contents of this paper include:(1)Based on the existing data enhancement technology,a preliminary attempt was made to construct a data set in the field of grammatical errors,which solved the problem of the scarcity of Chinese grammatical errors annotated corpus.Taking Easy Data Augmentation(EDA)as the core idea,by summarizing the distribution rules and structural characteristics of various errors,constructing a synthetic grammatical error data set SGB(Synthetic Grammatical error data Base)to expand training Corpus.The experimental results show that after the synthetic data set is added to the diagnostic model for training,the F1 value identified in the error position can be improved by nearly 8%.(2)A pipeline-based automatic diagnosis model of grammatical errors,TSM(Text classification & Sequence labeling & Mask language model)is proposed,which solves the problem of high false alarm rate in traditional models.This method decomposes the grammatical error automatic diagnosis task into three sub-tasks,namely: the text binary classification task of whether there are grammatical errors in the sentence,the sequence labeling task of the error location and error type,and the text error correction task.First,use the text classification model based on BERT-Finetune to classify whether there are errors in the sentence,and then predict the type and location of the errors in the sentence based on the Bi LSTM-CRF model fused with Ro BERTa,and finally apply the masking language model,Recommend correcting answers for sentences that contain two kinds of errors: word selection errors and missing errors. |