Research On Data Augmentation Methods For Chinese-Vietnamese Neural Machine Translatio

Posted on:2023-12-13

Degree:Master

Type:Thesis

Country:China

Candidate:J Yang

Full Text:PDF

GTID:2555306797982469

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The performance of neural machine translation systems depends to a large extent on large-scale high-quality parallel corpora.As a resource-poor language,Vietnamese has a large language difference with Chinese,resulting in poor quality of Chinese-Vietnamese machine translation.In view of the scarcity of Chinese-Vietnamese parallel corpus,the current mainstream data augmentation methods generate low-quality data,high noise,poor model translation of long sentences,and easy overfitting in training,this thesis develops a neural machine translation for Chinese-Vietnamese Based on the research on data enhancement methods of Chinese and Vietnamese,some new methods and technical improvements are proposed.The specific content and research are as follows:(1)Chinese-Vietnamese Neural Machine Translation Data Augmentation Method Based on Language Similarity FeaturesAs a typical resource-poor language,Vietnamese is expensive to obtain large-scale,high-quality Chinese-Vietnamese bilingual parallel corpora.Pseudo-parallel data generated by back-translation in low-resource contexts are noisy,and the translation model is sensitive to the quality and magnitude of the pseudo-parallel data.Direct inclusion of training data may degrade the performance of the translation model.In response to this problem,this thesis proposes to filter the obtained pseudo-parallel data from the three levels of lexical features,phrase component features,and sentence semantic features based on language similarity features,and filter pseudo-parallel data with higher quality.Incorporating the training corpus as additional data improves model performance.The translation experiments on the Chinese-Vietnamese dataset show that the filtered pseudo-parallel data scale of 140 K is the most suitable injection level for the model in this thesis.After the model is fully trained with iterative back translation,the translation performance is comparable to the baseline model.Compared with the increase of 1.29 bleu from Chinese to Vietnamese,and 1.52 bleu from Vietnamese to Chinese.(2)Chinese-Vietnamese Neural Machine Translation Data Augmentation Method Based on Syntactic Structure FeaturesThe quality of pseudo-parallel data generated by reverse translation depends on the inherent machine translation model.Due to the scarcity of parallel corpora under low resources,the pseudo-parallel data obtained by this method is more noisy.Compared with short sentences,the generated quality of long sentences is higher.The problem appears to be more serious.In order to alleviate this phenomenon,this thesis analyzes the long sentences in the training data by component syntax tree,extracts the structural features in the form of short sentences from the tree structure,and then uses the statistical machine translation model to translate the features.The purpose is to learn words,short sentences.More fine-grained alignment feature knowledge such as sentences,etc.,is finally added to the neural machine translation model training as augmented data.Experiments show that the method can still slightly improve the translation performance of ChineseVietnamese in both directions,and the translation of long sentences has been further improved compared with the long sentence translation of the baseline model.(3)Enhancing Chinese-Vietnamese Low-resource Neural Machine Translation Based on Random Deactivation StrategyIn low-resource neural machine translation,due to the small number of training samples and the large scale of parameters that the model needs to learn,overfitting is easy to occur.The random deactivation strategy is a common prevention method.At the same time,the input sentence is enhanced based on its own representation,and the neuron nodes in the network structure are shielded through the random deactivation strategy,which realizes the diversification of the vector representation of the input sentence.First,the sentence is input multiple times,and each time the sentence is input into the model,a random deactivation strategy is applied to ensure that the obtained sentence output distribution predictions are similar but different,and then the KL divergence is used to constrain the distribution.The degree loss function and the original cross-entropy loss function update the model together to enrich the sentence representation ability of the input samples.Experiments show that the method achieves the highest translation performance score in the mixed training of the existing bilingual data and the generated pseudo-parallel data,reaching a 23.87 bleu score in the Chinese-Vietnamese translation direction and a 23.26 bleu score in the Vietnamese-Chinese direction.(4)Construction of a Data Augmentation Prototype System for Chinese-Vietnamese Neural Machine TranslationThis thesis designs a data augmentation prototype system for Chinese-Vietnamese low-resource neural machine translation.The system model adopts data augmentation method to expand the corpus while using the existing small-scale Chinese-Vietnamese parallel data.The live strategy method trains the model to achieve "Chinese-VietnameseVietnamese" and "Vietnamese-Chinese" translation,providing users with a real-time online Chinese-Vietnamese neural machine translation platform.

Keywords/Search Tags:

Chinese-Vietnamese Neural Machine Translation, Data Augmentation, Pseudo-parallel Data, Back Translation, Dropout Strategy

PDF Full Text Request

Related items

1	Research On Data Reduction Methods For Neural Machine Translation
2	Research On Neural Machine Translation Based English Grammatical Error Correction
3	Research On Chinese-to-English Machine Translation For Medical Field
4	Research On Chinese-Vietnamese Machine Translation Based On Neural Network
5	Research And Implementation Of Chinese-Vietnamese Neural Machine Translation Integrating Translation Knowledge
6	The E-C Translation Of The Big Data Agenda: Data Ethics And Critical Data Studies (Chapter 1-2) And A Report On The Translation
7	Machine Translation Quality Estimation Based On XLM-R
8	A Report On The Translation Of Data Science For Business-What You Need To Know About Data Mining And Data-Analytic Thinking(Chapter Fourteen And Appendixes A&B) By Foster Provost And Tom Fawcett
9	Coping with Data-sparsity in Example-based Machine Translation
10	A Report On The Translation Of Machine Learning And AI For Healthcare:Big Data For Improved Health Outcomes (Excerpts)