Font Size: a A A

Research On Multilingual And Cross-domain Neural Machine Translation Technology Based On Transformer

Posted on:2024-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y ChenFull Text:PDF
GTID:2568307073468264Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Neural machine translation is a technique that uses artificial neural networks to translate one language(source language)into another(target language).Among the various machine translation methods,neural machine translation has achieved high quality translation in general domains with large amounts of parallel corpus data due to the use of neural network technology.However,there are still problems in adaptive neural machine translation for multilingual and domain:(1)Scarcity of corpus in minority languages and specialized domains leads to translation models that cannot effectively learn word vector representation with low resources,and there are problems of mistranslation and omission.(2)In multilingual neural machine translation,how to migrate other high-resource languages to enhance the semantic learning of low-resource languages.(3)In domain adaptive neural machine translation,there are problems of domain knowledge overfitting,a single model can only correspond to one domain,and the need for large-scale human adjustment of parameters during training.In response to the above questions,the study focuses on:(1)To address the problem of scarce corpus in small languages and specialized fields,the Scrapy crawler system was used to collect more than 1 million pieces of patent text information,and through data cleaning,chapter cutting,domain filtering and machine translation methods,more than 100,000 pieces of parallel corpora in six languages,such as English-Japanese and English-Spanish,in the field of information technology were constructed.The constructed parallel corpora were also evaluated using indicators based on utterance length,translation quality of real words,and translation quality of phrases.The top25% and the bottom 25% of the evaluated corpus were taken for translation model training.The results show that the BLEU values of the models trained on the first 25% of the corpus are all higher than those of the models trained on the second 25% of the corpus,with the English-French model having the highest BLEU value of 1.18.(2)A neural machine translation method based on semantic space sharing and self-back translation is proposed to address the problem of how to transfer knowledge of other highresource languages to enhance semantic learning of low-resource languages in multilingual translation.The method uses semantic space sharing to share the lexical representations of multiple languages into a common language space into a common word representation.And the self-back translation strategy is integrated into the semantic space sharing model to backtranslate the predicted sentences acquired in forward translation to fit the source sentences at each step of the training process to acquire more contextual knowledge in a limited corpus situation.Several experiments were conducted on four low-resource language datasets from Romanian(Ro),Azerbaijani(Aze),Belarusian(Bel),Galician(Glg)to English.The experimental results show that the BLEU values improve by 4.3 for Romanian(Ro)and 5.1 for Galician(Glg)compared to the baseline model,indicating that the proposed method achieves significant improvement in translation quality in multilingual low-resource situations.(3)To address the problems of knowledge overfitting,poor model flexibility,and dominance of human experience in domain adaptive neural machine translation,this paper proposes a multi-domain adaptive approach(KAIP)based on knowledge augmentation and incremental pruning.The method uses a knowledge-hiding strategy to use an auxiliary corpus for auxiliary task learning during training,feedforward augmentation of the knowledge passed from the encoder to the decoder,and then uses a model pruning strategy to learn multiple disconnected domain-specific sub-networks to adapt to multiple different domains without adjusting the model.The single and multiple domain adaptation tasks on four target domain datasets and five extended domain datasets show significant improvements in BLEU values for each domain,with a 2.3 improvement in BLEU values on the Novel domain,a 1.1 improvement on the EMEA domain,and a 1.4 improvement on the IT domain.It is verified that the proposed method in this paper can effectively cope with domain adaptive tasks.
Keywords/Search Tags:Neural machine translation, Semantic space sharing, Corpus evaluation, Domain adaptive translation
PDF Full Text Request
Related items