Automatic Grammatical Error Correction For Lowresources Languages

Posted on:2023-11-30

Degree:Doctor

Type:Dissertation

Institution:University

Candidate:Aiman Solyman

Full Text:PDF

GTID:1525306830483514

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Grammatical error correction(GEC)is a classical natural language processing(NLP)task that aims to automatically correct grammatical and related errors in texts.The importance of GEC arises from the increasing number of second language learners around the world,children,students,and also helps native speakers to correct large text files.In addition,some NLP tasks,such as part-of-speech and text summary,require GEC to check and correct the input text.The classical approaches of GEC such as rule-based and classifiers have been developed to correct certain types of errors,which are limited to correcting complex and multiple errors in sentences.Recently,neural machine translation(NMT)has been shown to be powerful and well established in GEC.The main challenge for NMT in GEC is that it requires large parallel training data,which is not available for low-resource languages such as Arabic.Arabic GEC is still growing due to some challenges such as the lack of learning data,the complexity of grammar,and the richness of morphological features in Arabic.This work seeks to overcome the limitations of the previous work,and the contributions are summarized in the following:(1)To overcome the challenge of previous AGEC based on recurrent neural networks(RNN)which localized on the nearby words only,a GEC model based on a convolutional sequence-tosequence model consisting of nine encoder-decoder layers with an attention mechanism was proposed.Convolutional Neural Networks(CNN)give the proposed GEC model the ability to combine feature extraction and classification in one task.CNN-based GEC has proven to be an effective way to capture local context features and easily detect long-term dependencies due to stacking convolutional layers.In addition,a semi-supervised method was proposed to generate synthetic training data to increase the training set based on the confusion function.(2)In order to address the problem of mismatched data distribution arising from previous approaches that generated synthetic training data from different data destitution,which poses an additional challenge for low resource GEC.Seven different data augmentation approaches(DA)are proposed to address the main challenges in low-resource GEC: Data sparsity and mismatch data distribution.The augmented data in this work aims to influence the target side instead of the source side,generates new context during training and makes the target prefix less informative for predicting the next word.In this way,the encoder is strengthened and forces the decoder to pay more attention to the source representations of the encoder representation when generating a new word.The impact of the proposed DA approaches was investigated using a multi-head attention network(Transformer)to correct grammatical errors in Arabic.Experimental results on two benchmarks QALB-2014 and QALB-2015 showed that the proposed approaches outperformed the classical misspelling DA method and the Arabic GEC baseline.(3)The key shortcoming of GEC seq2 seq models with multiple encoder-decoder layers is that only the top layer is exploited in the subsequent processes.Furthermore,due to the exposure bias problem during inference,some of the previous target words are deleted and replaced by other words generated by the model itself,leading to unsatisfactory output.To this end,a GEC model based on Transformer was proposed for low-resource languages to address the seq2 seq GEC issues.Motivated by the success of capsule networks in computer vision,the ExpectationMaximization routing algorithm was used to dynamically aggregate information across layers in Arabic GEC.In addition,to overcome the exposure bias problem,a bidirectional regularization term was proposed that uses the Kullback-Leibler divergence in the training objective to improve the agreement between right-to-left and left-to-right models.Moreover,a noise method has been proposed to construct synthetic parallel data to overcome the bottleneck arising from the lack of corpus.Experiments on two benchmarks QALB-2014 and QALB-2015 have shown that the proposed model achieves the best F1 result compared to the existing Arabic GEC systems.

Keywords/Search Tags:

Grammatical Error correction, Low-resources languages, Data augmentation, Sequence to sequence learning, Expectation maximization routing

PDF Full Text Request

Related items

1	Research In Grammatical Error Correction Based On Sequence Generation Model
2	Grammatical Error Correction Based On Sequence Generation
3	Research On Words Recognition Of Historical Mongolian Documents Based On Sequence To Sequence Model
4	Research On Neural Machine Translation Based English Grammatical Error Correction
5	Research On Vietnamese Text Grammatical Error Correction Method Integrating Multi-granularity Feature
6	Design And Implementation Of An Automatic Grammar Error Correction Model For English Text
7	Research On Mongolian Handwritten Recognition Based On Data Augmentation And Correction Network
8	Study On Sequence Representation
9	The Effect Of Attentional Resources On Implicit Sequence Learning:Evidence From ERP
10	Analysis And Application Of Grammatical Error Correction Model For English Learner