Font Size: a A A

A Research On Key Methods In Tibetan-Chinese (Chinese-Tibetan) Machine Translation Under Low-Resource Condition

Posted on:2021-04-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z J C CiFull Text:PDF
GTID:1365330611994955Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the breakthrough progress of deep learning in the study of natural language processing,machine translation technology has also undergone revolutionary changes and is widely used in people's lives and work.As a unified multi-ethnic country,our government attaches great importance to the natural language processing of ethnic minorities.In recent years,related technologies of natural language processing of ethnic minorities in China have developed vigorously,especially in Tibetan natural language processing field,which has continued to progress,whether based on statistics or neural network method requires large-scale bilingual or multilingual language resources as support.For Tibetan-related machine translation research,there are still problems such as scarce data resources,weak theoretical foundations,and immature technical methods.This thesis hopes to adopt The study of theories and methods on the construction of Chinese-Tibetan language resources,the integration of mono-language models,and introduction of back-translation strategies,and the modeling of cross-language models,provide an effective research idea for Tibetan-Chinese machine translation under conditions of low language resources,and alleviate the current resource scarcity issue and translation performance in Tibetan-Chinese machine translation satisfactory and so on,provide technical support for the harmonious development of Tibetan society and economy.The main work and innovations of this thesis are summarized as follows:Aiming at the problem of the lack of Tibetan-Chinese language data resources,a language resource construction technology for Tibetan-Chinese machine translation was proposed,and a Tibetan-Chinese monolingual corpus and a Tibetan-Chinese bilingual corpus based on the People's Daily(Tibetan version)were constructed.This thesis first uses a large-scale web-based resource acquisition technology to collect Tibetan monolingual news corpora,and uses the CNN+Bi-LSTM+CRF Tibetan word segmentation technology and Bi-LSTM+CRF named entity recognition technology to identify and extract Tibetan time,place,person,organization and other named entities in the language news,through translate them,and get the corresponding Chinese named entities,and then match the corresponding Chinese news with the Chinese named entities and preprocess them.Text similarity calculation and cross-language sentence similarity calculation based on Bi-LSTM + Attention model,perform chapter and sentence alignment on news texts in Tibetan and Chinese languages,and finally build a Tibetan monolingual with 538340 sentences,Chinese monolingual with 617590 sentences A Tibetan-Chinese bilingual corpus of 537620 sentence pairs.Aiming at the problem of the lack of bilingual parallel language resources in Tibetan-Chinese machine translation.Thanks to the richness of Tibetan monolingualism,a Tibetan-Chinese machine translation technology method incorporating a monolingual language model was proposed.In this work the author first trains a Tibetan monolingual language model by using a recurrent neural network language model modeling method and fuses it with the pre-output of the translation model's decoding end.Using shallow fusion and deep fusion,the translation model is generated.The words and words generated by the language model are weighted again to make the source language and the target language have a mapping relationship,and finally output the target language.Through this Tibetan-Chinese machine translation modeling method that integrates a monolingual language model,the performance of Tibetan-Chinese machine translation under poor language resources can be effectively improved.After experiments,under the same resource conditions,it improves the original Transformer baseline system.3.4(Tibetan-Chinese)and 4.7(Chinese-Tibetan)BLEU values.Aiming at the problem of poor performance of Tibetan-Chinese machine translation under low resource conditions,an iterative back-translation strategy for Tibetan-Chinese machine translation was proposed.This thesis first constructs a Transformer initial system based on existing Tibetan-Chinese parallel data resources,and translates large-scale monolingual corpora(forward)to obtain correct sentences on the source side and sentences generated by the translation on the target side.The sentence-pair filtering mechanism constructs a pseudo-Tibetan-Chinese bilingual parallel corpus with strong supervising information,adds it to the training of the translation model,and then back-translates(reverse)to obtain the correct sentence at the target end in the same way,and the source end is a sentence generated by translation.After repeated iterative experiments,the original system's original performance index has improved 6.7(Tibetan-Chinese)and 9.8(Chinese-Tibetan)BLEU values.Aiming at the problem of limited scale and domain of Tibetan-Chinese parallel data resources and poor adaptability to supervised neural network machine translation models,a Tibetan-Chinese machine translation method for cross-language model modeling was proposed.This thesis first builds a Tibetan-Chinese machine translation system with Transformer as the baseline system.Through pre-training of Tibetan and Chinese mask language models and modeling of Tibetan-Chinese cross-language translation models,rich resources(Chinese)and poor resources are established.(Tibetan)mapping relationship,and then using Tibetan-Chinese double sentences with detailed language information and location information labels as the text stream as input,through this pre-training method to optimize the performance of the baseline model,experiments show that BLEU values increased by 8.1(Transformer Base+MLM and Transformer Base)and 5.7(Transformer Base+MLM and SMT).In this work the author studies the construction of Tibetan-Chinese language resources and related technologies to build a language resource library for Tibetan-Chinese machine translation.And then the constructed dataset is piped into actual Machine Translation model training.Experiment indicates a promising performance boost compared with several strong baselines.I hope this work will provides a reference and method for the study of Tibetan-Chinese machine translation.
Keywords/Search Tags:Tibetan-Chinese Machine Translation, Low Resource, Neural Network, Fusion, Iterative Back Translation, Cross-lingual Model
PDF Full Text Request
Related items