Font Size: a A A

Reordering Of Source Language Sentences For Statistical Machine Translation

Posted on:2016-05-31Degree:MasterType:Thesis
Country:ChinaCandidate:J S CaiFull Text:PDF
GTID:2298330467972828Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Machine Translation (MT) is the research that employs computer to translate a natural language to another one. It is one of the main branches of Natural Language Processing (NLP). Statistical Machine Translation (SMT), as one of the most popular approaches of MT, has systemic theoretical basis and well-developed translation models. It can develop an MT system quickly and effectively. However, SMT systems usually have bad performance while translating between distant language pairs where there exist great differences in the word orders of source language and target language. One of the main reasons for this is that current theories and models can not acquire and describe the differences in word orders very well.To tackle this problem, this thesis proposes two syntax-based reordering approaches for the application of SMT, including the constituent-based one and the dependency-based one. The main idea of these approaches is pre-processing of source sentences in training data and test data, which is based on syntax information. Reordering can make the word order of source language be closer to target language, and alleviate the effect of differences in word orders on the application of SMT. The contributions of this thesis are threefold as follows.(1) After implementing and analyzing current constituent-based approach, this thesis proposes refined approach and augments current Chinese-English constituent-based reordering rule set.(2) This thesis proposes a dependency-based reordering approach, of which the core is a novel and systemic framework to create reordering rules. Based on this framework and the characteristics of Chinese-English, Chinese-Japanese and Japanese-Chinese, this work implements dependency-based reordering for SMT systems of these language pairs, through creating three reordering rule sets, respectively. Note that all of these three reordering rule sets are novel dependency-based ones in each translation, respectively.(3) Since parsing results are the basis of syntax-based reordering, the accuracy of the parser is important. This work novelly conducts overall comparison of several open-source parsers. With quantification approach, the comparison evaluates and analyzes the relation between the accuracy of parser and the performance of reordering approaches, and the effect on the performance of SMT systems. This work fills the gap of this field and provides best choices of parsers for the application of reordering approaches.This work implements the reordering approaches proposed in this thesis, through the development of Chinese-English, Chinese-Japanese and Japanese-Chinese SMT systems. The experiment results, which are based on large-scale data sets, demonstrate the effectiveness of these approaches. The Kendall’s τ evaluation and the human evaluation also indicate effectiveness and accuracy of these approaches. Moreover, this work proposes a novel reordering evaluation approach, which is based on the numbers of cross word alignments.
Keywords/Search Tags:statistical machine translation, reordering of source language sentences, reordering rule, constituent parsing, dependency parsing, comparison of parsers
PDF Full Text Request
Related items