Chinese Mongolian parallel corpus (CM parallel corpus for short) is a corpus that contains Chinese source texts and their Mongolian translations. It can be used to support all kinds of Chinese Mongolian bilingual processing systems (Chinese Mongolian machine translation system for example).Alignment is the first important processing to make parallel corpora applicable. Finding corresponding segments and establishing links between them in parallel corpora is called alignment. According to different language units that the segments represent, there are tasks of section alignment, paragraph alignment, sentence alignment, phrase alignment and word alignment etc. In most cases, parallel corpora have to be word-aligned before they are ready for real applications as SMT, EBMT, WSD and bilingual dictionary compilation etc.Most data in CM parallel corpus have been collected manually up till now and the corpus has already been aligned at sentence level by the operators. Therefore, we are starting with the task of automatically aligning words within aligned sentences.Language independent word aligners are available now days, like the well-known GIZA++, and they usually perform well when the training data are extremely large, which is because they are built with pure statistical methods. Considering that CM parallel corpus is a small-scale one, we need to find an alternative solution for Chinese Mongolian word alignment.In view of the resources available to us, we present a knowledge intensive approach for Chinese Mongolian word alignment in this thesis, along with a review of related researches. The main idea of the approach is to build a framework firstly with a dictionary-based Greedy algorithm, and gradually improve its performance by integrating all kinds of external knowledge and information step by step, which includes Mongolian synonyms, Mongolian inflectional morphology, Mongolian consecutive multi-word units, rules for converting Chinese Mongolian numeral words into Arabic numerals and correspondence regularities for Chinese prepositions and Mongolian cases. Concretely,(1) to combine a Mongolian thesaurus with Chinese-Mongolian bilingual dictionary in order to provide more translational information for the alignment process; (2) to recognize Mongolian stems by analyzing words in texts morphologically, so that the system will be able to compute the similarities between Chinese words and Mongolian basic forms instead of Mongolian inflectional forms, which we believe is more reasonable for dictionary-based aligning approach;(3) to identify Mongolian consecutive multi-word units in Mongolian text, "tie" them up and treat them as a single word in alignment, which will help obtain some of the "1 :n" links;(4) to covert Chinese and Mongolian numeral words into Arabic numerals respectively and align them;(5) to find the translational relations between Chinese prepositions and Mongolian cases based on their correspondence regularities.In addition, to manually build a gold standard reference with consistency, the thesis presents some guidelines and rules for Chinese Mongolian word alignment annotation, since there are always disagreements and inconsistencies among different annotators.For the experiments, recall, precision and F-value are used as evaluation metrics. The alignment approach presented in the thesis are tested on three different types of test sets, namely daily-life sentence set, governmental document set and novel set. It turned out that the external knowledge and information is helpful for improving the performance of the dictionary based framework in many ways, and the system gets its best result on daily-life sentence set with 59.2% recall and 81.4% precision. |