| With the wide application and rapid development of statistical methods in natural language processing, the corpora as a basic resource, is becoming more and more important. Bilingual corpora in particular, plays an important role in areas such as machine translation and cross-language information retrieval. However, the parallel corpora, which is typical bilingual corpora resource, is limited and difficult to obtain. This situation is even more serious for weak languages such as minority languages in China. Therefore, comparable corpora in recent years gradually attracted the attention of researchers. Comparable corpora can be directly used for mining translation equivalences, including named entities, new words and terminology, and can lay the foundations for building parallel corpora.Early comparable corpora was primarily based on features of text content, with low accuracy and efficiency, and now are generally based on cross-language information retrieval, except some scholars use Wikipedia resources to mine comparable corpus. Given the limited resources available in minority languages, machine translation system immature, and so on, this paper, based on the idea of cross-language information retrieval, build Tibetan-Chinese bilingual comparable corpora, using bilingual dictionaries and the Internet open search engine.We collected Tibetan news corpus from major Tibetan news website such as ’www.xzxw.com’, as the source documents, and then carried out the following work:First, extracting keywords. Based on the traditional algorithm of TF-IDF, We integrate some more features of the words, including the location of first apperance, length, part of speech, to improve the quality of keyword extraction.Second, keyword translation. To get keywords of Chinese documents from ones of Tibetan source documents, a Chinese-Tibetan dictionary is needed. For a Tibetan word in dictionary usually corresponds to a number of Chinese meanings, this paper carried out disambiguation by the global co-occurrence-based approach, reducing query term combinations, and improving efficiency of comparable corpora construction.Third, the introduction of some named entities, acting as query terms with the extracted keywords together. Extract person names, place names, time words and quantity words from the title and first paragraph of the document. These named entities usually indicate some elements of news events, times, places, characters, etc. the introduction of named entities increase the proportion of comparable corpus in the results returned by the search.Fourth, filtering the candidate of comparable corpus by the method of bilingual document similarity calculation. Search Chinese query terms having been gotten through the Internet search engine.Make the search results ranked in the top part as the candidate of comparable corpus. Choose the dice coefficient as the decision factor, and determine the threshold value through experiments.This paper collected 1120 articles of Tibetan source material, and obtained 4576 articles of Chinese comparable corpus.78 percent of Tibetan source material obtained corresponding comparable corpus. Experiments show that the comparable corpora builded by the proposed method of this paper, have wide fields, good timeliness and expandability, suitable for building large-scale comparable corpora. This method can also be applied to the build of comparable corpora of other minority languages. |