Data analysis and selection for statistical machine translation

Posted on:2017-11-27

Degree:Ph.D

Type:Dissertation

University:Michigan State University

Candidate:Eetemadi, Sauleh

Full Text:PDF

GTID:1445390005965007

Subject:Computer Science

Abstract/Summary:

Statistical Machine Translation has received significant attention from the academic community over the past decade which has led to significant improvements in machine translation quality. As a result it is widely adopted in the industry (Google, Microsoft, Twitter, Facebook,...etc.) as well as the government (http://nist.gov). The biggest factor in this improvement has been the availability of ever increasing sources of training data as digital multilingual communication and information dissemination become ubiquitous. Relatively little research has been done on training data analysis and selection, despite training data being the main contributor of machine translation quality.;In this work, we first examine fundamental properties of translated and authored text. We introduce a new linguistically motivated feature (Part of Speech Tag Minimal Translation Units) that outperforms prior work in sentence level translation direction detection. Next, we develop a cross-domain data matrix that enables comparison between different features in the translation direction detection task. We extend our previously introduced feature for translation direction detection to use statistically trained brown clusters instead of part of speech tags. This new feature outperforms all prior work in all cross-domain data matrix combinations.;Data selection in machine translation is performed in different scenarios with different objectives including: reducing training resource consumption, domain adaptation, improving quality or reducing deployment size. We develop an efficient (computational complexity and memory consumption is linear in training data size) framework for training data selection and compression called Vocabulary Saturation Filter (VSF). In our experiments we show the machine translation system trained on data selected using VSF is comparable to prior data selection methods with quadratic computational complexity. However, VSF is sensitive to data order. Therefore we experiment with different orderings of the data and compare the results.;Finally, we develop a highly scalable and flexible data selection framework where arbitrary sentence level features can be used for data selection. In addition, a variable threshold function can be used to incorporate any scoring function that is constant throughout the selection process. After introducing this framework, inspired by the features we introduced for detecting translation direction, we use joint models of source and target using Minimal Translation Units (MTU) in addition to source side context using brown clusters to compare various features and threshold functions within this framework. We run end-to-end experiments using data selected by various methods and compare the statistical translation models using various test sets and phrase table comparison metrics.

Keywords/Search Tags:

Translation, Data, Selection, Using

Related items

1	The E-C Translation Of The Big Data Agenda: Data Ethics And Critical Data Studies (Chapter 1-2) And A Report On The Translation
2	A Report On The Translation Of Data Science For Business-What You Need To Know About Data Mining And Data-Analytic Thinking(Chapter Fourteen And Appendixes A&B) By Foster Provost And Tom Fawcett
3	A Report On E-C Translation Of Big Data In Practice: How 45 Successful Companies Used Big Data Analytics To Deliver Extraordinary Results (Excerpts)
4	A Project Report On Translation Of Big Data In Practice:How 45 Successful Companies Used Big Data Analytics To Deliver Extraordinary Results(Chapters From 3�9)
5	A Report On The Translation Of Data Science For Business-What You Need To Know About Data Mining And Data-analytic Thinking(Chapter 3)
6	A Translation Project Report Of Data Science For Business -What You Need To Know About Data Mining And Data-analytic Thinking(Chapter One) By Foster Provost And Tom Fawcett
7	An E-C Report On An Introduction To Data-Everything You Need To Know About AI,Big Data And Data Science (Excerpt)
8	A Project Report On The Translation Of Data Science For Business-what You Need To Know About Data Mining And Data-analytic Thinking(Chapter 12)
9	On Goldblatt’s Translating Of Shengsi Pilao—— From The Perspective Of The Approach To Translation As Adaptation And Selection
10	The English-chinese Translation Of Critical Analysis Of Big Data Challenges And Analytical Methods And A Report On The Translation