Font Size: a A A

Data analysis and selection for statistical machine translation

Posted on:2017-11-27Degree:Ph.DType:Dissertation
University:Michigan State UniversityCandidate:Eetemadi, SaulehFull Text:PDF
GTID:1445390005965007Subject:Computer Science
Abstract/Summary:
Statistical Machine Translation has received significant attention from the academic community over the past decade which has led to significant improvements in machine translation quality. As a result it is widely adopted in the industry (Google, Microsoft, Twitter, Facebook,...etc.) as well as the government (http://nist.gov). The biggest factor in this improvement has been the availability of ever increasing sources of training data as digital multilingual communication and information dissemination become ubiquitous. Relatively little research has been done on training data analysis and selection, despite training data being the main contributor of machine translation quality.;In this work, we first examine fundamental properties of translated and authored text. We introduce a new linguistically motivated feature (Part of Speech Tag Minimal Translation Units) that outperforms prior work in sentence level translation direction detection. Next, we develop a cross-domain data matrix that enables comparison between different features in the translation direction detection task. We extend our previously introduced feature for translation direction detection to use statistically trained brown clusters instead of part of speech tags. This new feature outperforms all prior work in all cross-domain data matrix combinations.;Data selection in machine translation is performed in different scenarios with different objectives including: reducing training resource consumption, domain adaptation, improving quality or reducing deployment size. We develop an efficient (computational complexity and memory consumption is linear in training data size) framework for training data selection and compression called Vocabulary Saturation Filter (VSF). In our experiments we show the machine translation system trained on data selected using VSF is comparable to prior data selection methods with quadratic computational complexity. However, VSF is sensitive to data order. Therefore we experiment with different orderings of the data and compare the results.;Finally, we develop a highly scalable and flexible data selection framework where arbitrary sentence level features can be used for data selection. In addition, a variable threshold function can be used to incorporate any scoring function that is constant throughout the selection process. After introducing this framework, inspired by the features we introduced for detecting translation direction, we use joint models of source and target using Minimal Translation Units (MTU) in addition to source side context using brown clusters to compare various features and threshold functions within this framework. We run end-to-end experiments using data selected by various methods and compare the statistical translation models using various test sets and phrase table comparison metrics.
Keywords/Search Tags:Translation, Data, Selection, Using
Related items