Font Size: a A A

Research On The Transliteration Model Of Tibetan And Chinese Names Based On Hybrid Strategies

Posted on:2020-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:C ShaoFull Text:PDF
GTID:2435330602456447Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Transliteration is defined as the phonetic translation of names across languages.Transliteration of Named Entities(NEs)is a necessary subtask in many applications,such as machine translation,corpus alignment,cross-language IR,information extraction and automatic lexicon acquisition,and transliteration of names is an important part of NEs.Based on the study of the transliteration model of Tibetan and Chinese names based on the characters,this paper summarizes the shortcomings of less resources in transliteration,and proposes a transliteration strategy of“determining pronunciation first,then determining fonts",and by linear combination and using pronunciation information.The sorting strategy reorders the candidate transliteration pairs.We focus our research on the followings:1.This paper first introduces the shape-based transliteration framework of Tibetan and Chinese names,and analyzes and compares two transliteration models:joint source channel model and conditional random field model.In the joint source channel model,we introduce how the Tibetan-Chinese name transliteration model performs segmentation alignment based on the shape-based premise,and introduces the Beam Search algorithm used to generate multiple candidate transliteration pairs,and then analyzes the output.The intermediate parameters,according to the binary transliteration pair,the unary transliteration pair,the relationship between the source language syllable and the target language syllable,find the reason why the corresponding relational matrix is sparse because of the polyphonic phenomenon in the Chinese character,which was done for the latter work.In the conditional random field model,we briefly introduce the template design of the model and the principle of template design.2.Tibetan is a language with less resources,and it is difficult for us to obtain a large number of Tibetan and Chinese bilingual corpora.This paper puts forward the strategy of "determining pronunciation first and then determining font type".It uses Tibetan Latin transfer and Chinese Hanyu Pinyin to rewrite Tibetan names and Chinese names,and divides Tibetan and Chinese names into Tibetan from Tibetan to Latin.From Latin to Pinyin and from Pinyin to Chinese characters.The bilingual pronunciation is used to determine the Chinese pronunciation of the Tibetan name,and then a large number of Chinese monolingual corpora are used to determine the font.When Latin is transferred to the pinyin stage,the intermediate parameters are output and compared with the shape-based results.3.It is proved that the strategy of "determining the pronunciation first and then determining the font" can effectively reduce the data sparseness of the corresponding relation matrix.Improve the final effect of transliteration.In the stage of pinyin to Chinese characters,a large number of Chinese monolingual names are used to train Pinyin to Chinese character models,and the demand for Tibetan and Chinese bilingual corpus,which is difficult to obtain,is transformed into the demand for Chinese monolingual corpus,which is easier to obtain,and is improved to some extent.The accuracy of the model is obtained,so that better transliteration results are obtained.4.This article also examines how to reorder the generated candidate transliteration pairs.After analyzing the characteristics of the candidate transliteration pairs obtained earlier and the principle of reordering,we use the linear combination strategy and the strategy of reordering the pronunciation information to design four experiments to verify the different models of the same method and the different methods of the same model.The effect of the linear combination strategy and the pronunciation rule generated by the phoneme-based experiment are used as the standard to reorder the experimental results based on the form factor,and the experimental results are analyzed.
Keywords/Search Tags:Tibetan-Chinese Transliteration, Joint Source Channel Model, Segmentation Granularity, Reranking
PDF Full Text Request
Related items