Font Size: a A A

English-Chinese Bilingual Phrasal Alignment

Posted on:2008-01-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:G QuFull Text:PDF
GTID:1115360242476067Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
statistic strategy, because it has the capacity of handling uncertainty and avoids the need for costly hand-tagged training data. Research on bilingual corpus processing is of significant importance, especially for domain-specific machine translation with high recurrence rate.English-Chinese phrasal alignment is an important part of bilingual corpus processing, which reveals the corresponding relation between two parallel sentences at phrasal level. Phrasal aligned bilingual corpus can be used for English-Chinese translation knowledge acquirement. The input of the system is domain specific, sentence level aligned, raw bilingual corpus, the output of the system is syntax tree pair aligned in phrase level.The conventional phrasal alignment method is to parse source sentence and target sentence separately, then align the syntax tree pair. The drawback is that the correct rate of parsing affects the rate of alignment greatly.In light of the idea"two language are more informative than one", the method that aligning and solving ambiguities simultaneity is produced, which uses source language as a extra information to solve ambiguities in target language and uses target language as a extra information to solve ambiguities in source language.The theoretical foundation of the implementation is the alignment model which describes the relation of constraints and correspondence between source syntax tree and target syntax tree. The key problem is to reveal the complex corresponding relation while avoiding the interference of translation divergence. This paper presents principal of constancy in translation, and abstract tree based alignment model in order to solve divergence problem.This paper presents a double state based hidden Markov model in POS tagging. Since only very small portion (<1500) of words has more than one POS, we specialize these words by attaching extra transfer probability with them. So each word has its specific information, the performance is improved.The system is consists of three modules: resource management module (including English/Chinese corpus management module, English/Chinese dictionary management module), providing the function of accessing the corpus and dictionary; Preprocessing module (include English/Chinese POS tagging and English/Chinese parsing), producing syntax tree pairs; Alignment module (including sentence alignment, word alignment and phrasal alignment).
Keywords/Search Tags:Natural Language Processing, Phrase alignment, bilingual corpus
PDF Full Text Request
Related items