The Chinese-Burmese bilingual parallel corpus is the basic resource for the study of Chinese-Burmese Machine Translation,cross language retrieval,parallel sentence extraction and bilingual entity extraction.Based on model analysis of topic model of cross language as variety language documents,it can calculate the correlation between different language documents from the semantic level,it provides a great support that we obtain the Chinese-Burmese document,therefore,how to build Chinese-Burmese bilingual topic model has an important significance for acquiring Chinese-Burmese document.In this paper,we take the corpus construction as the starting point,and obtain the comparable corpus through the topic model,the main achievements are as follows:(1)The construction of the Chinese-Burmese parallel corpus.The Chinese-Burmese bilingual text resources are scarce,there is no public authority of the Chinese-Burmese corpus,bilingual topic model construction requires a certain amount of Chinese and Burmese bilingual parallel document topic model as the training set,and study the parallel document quality will influence the text topic model in the future.In this paper,we gives a detailed introduction to the methods of Chinese-Burmese bilingual texts,including web page text,electronic magazine and WeChat platform.For the web page text,detailing the use of reptiles technology to automatically obtain the process,for the electronic magazine and WeChat platform,illustrates the process of manual acquisition also.Finally,the resources are integrated into the Chinese-Burmese bilingual parallel corpus and the corresponding data storage methods are illustrated.(2)This paper proposes a new model of Chinese-Burmese bilingual theme based on context features.The model is based on the bilingual LDA topic model,which integrates the context features of the text.The bilingual LDA model uses the relevance of the parallel text,that is to say,the parallel text shares the same text topic distribution matrix,while the fusion context feature solves the problem that the model does not consider the text structure.The essence is to model fusion reduces the negative impact on the theme of the text word frequency distribution,the experimental results show that the proposed fusion context features of the Chinese-Burmese bilingual topic model has a better effect in the text subject distribution.(3)This paper proposes a new model of Chinese-Burmese bilingual theme based on semantic extension.The theme is based on the characteristics of previous chapter model,further integration of Chinese-Burmese semantic dictionary,by analyzing and processing of the dictionary,constructed the Chinese semantic extension Burmese set based context features of words weighted weights,set a threshold,to exceed the threshold of words by the extended set corresponding to the expansion of Burmese,through the semantic extension,can solve the problem of a variety translations of Burmese words.We will expand the context feature and semantic feature fusion in a bilingual LDA model,finally,through the comparative analysis of experimental results,this paper constructs the bilingual topic model based on multi feature fusion with the contrast experiment has a better performance. |