Font Size: a A A

Research On Key Technologies Of Chinese-Tibetan Neural Machine Translation

Posted on:2022-09-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:D C R TouFull Text:PDF
GTID:1485306509997969Subject:Tibetan Computational Linguistics
Abstract/Summary:PDF Full Text Request
Machine translation is a branch of computational linguistics which involves the intersection of computer technology,mathematics,cognitive science,linguistics,information theory and other disciplines.Machine translation will transform one kind of natural language into another natural languages.It is one of the ultimate goals of artificial intelligence.The research on Chinese-Tibetan machine translation technology is of great practical significance for inheriting and carrying forward the excellent ethnic culture,enhancing cultural exchanges,seeking for communication of ideas,serving the national One Belt and One Road initiative,and promoting the development of society,economy,education,and culture in Tibetan areas of China.It can promote the substantive development of computational linguistics of Tibetan language and has very important scientific research value and practical application value.Based on the construction of a high-quality Chinese-Tibetan bilingual parallel corpus,this thesis makes a preliminary exploration on key technologies such as Tibetan long sentence segmentation technology,Tibetan place name recognition technology and improvement of Tibetan byte pair encoding,in an at tempt to improve the translation performance by optimizing the Chinese-Tibetan neural machine translation model.Specifically,the research content of this thesis mainly includes the following aspects:(1)Corpus preprocessing: This thesis focuses on the segmentation methods of long sentences in Tibetan.We summarize the rules of Tibetan sentence boundary recognition,finds out the difficulties to improve of sentence boundary recognition.We propose a segmentation method of Tibetan sentences that integrates the sequence labeling framework based on deep learning Bi-LSTM(Bi-Long Short-Term Memory)+CRF(Conditional Random Fields)and the Tibetan dependency syntax structure to divide the long sentences in Tibetan.The experiment shows that,the method can effectively segment long Tibetan sentences,and the F value reaches 99.42%.(2)Tibetan named entity recognition: This thesis mainly introduces the Tibetan name recognition technology,and expounds that the syllable,trigger word,follow-up word and case auxiliary word of Tibetan name names are applicable to the place name recognition based on CRF.The experimental results show that the accuracy rate,recall rate and F value of the proposed method are 96.12%,81.92% and 88.45%,respectively.In order to deal with Tibetan local nouns,this thesis integrates Tibetan place name recognition technology into the training corpus segmentation,and the BLEU value reaches 30.46,which improves the translation effect of Chinese and Tibetan named entities.(3)Model improvement: In this thesis,by improving the byte pair encoding algorithm,a Tibetan byte pair encoding algorithm with word number threshold is proposed to optimize the Chinese-Tibetan neural machine translation model based on attention mechanism.One million pairs of Chinese-Tibetan sentences and 200,000 dictionaries of ChineseTibetan names and placenames were collected and sorted.The ChineseTibetan neural machine translation model is trained based on these data.Through testing and validation,the BLEU value of the model reached36.84.The model constructed in this thesis is better than the commercial Chinese-Tibetan online translation system in terms of the translation effect of named entity.(4)System improvement: A Chinese-Tibetan neural machine translation system is designed and implemented based on attention mechanism and improved byte pair encoding which optimized the backend process and program.The system is deployed in the Sunlight ChineseTibetan machine translation website,and makes the promotion of the Sunlight Chinese-Tibetan neural machine translation system V2.
Keywords/Search Tags:Chinese-Tibetan neural machine translation, attention mechanism, corpus, place name recognition, byte pair coding
PDF Full Text Request
Related items