Research On Pivot-based Statistical Machine Translation

Posted on:2020-04-11

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X N Zhu

Full Text:PDF

GTID:1368330590472772

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Machine translation refers to the use of computers to translate from one natural language into another natural language.Recent years,statistical machine translation and neural network machine translation have become the mainstream of machine translation research.The basic idea of statistical machine translation and neural network machine translation is to use learning algorithms to learn translation rules from large-scale bilingual corpus.However,large-scale bilingual corpus are not always available in some language pairs.To alleviate the data scarceness of machine translaiton,the pivot language approach is proposed as a �bridge� to connect the source and target language.The premise of the pivot approach is that a large number of source-pivot and pivot-target parallel data are available.In this paper,we focus on the following aspects of pivot-based statistical machine translation.1.The probability estimation of pivot-based translation.The traditional pivotbased approach based on the phrase-based statistical machine translation,and proposes to build a source-target phrase table by merging the source-pivot and pivottarget phrase table.One of the key issues in this method is to estimate the translation probabilities for the generated source-target phrase pairs.Conventionally,the probabilities are estimated by multiplying the posterior probabilities of source-pivot and pivottarget phrase pairs.However,it has been shown that the generated probabilities are not accurate enough.One possible reason may lie in the nonuniformity of the probability space.To solve this problem,we propose a novel approach that utilizes the co-occurrence count of source-target phrase pairs to estimate phrase translation probabilities more precisely.Different from the triangulation method,which merges the source-pivot and pivot-target phrase pairs after training the translation model,we propose to merge the source-pivot and pivottarget phrase pairs immediately after the phrase extraction step,and estimate the cooccurrence count of the source-pivot-target phrase pairs.Finally,we compute the translation probabilities according to the estimated co-occurrence counts,using the standard training method in phrase-based SMT.Experimental results on Europarl data and web data show that our method leads to significant improvements over the baseline systems.2.Hidden translation rules mining in pivot-based machine translation.One of the weaknesses of current pivot-based machine translation is that some corresponding source and target phrase pairs cannot be generated,because they are connected to different pivot phrases.To solve this problem,we apply a Markov random walk method to pivot-based SMT system to discover potential translations be-tween source and target language via the pivot language.Experimental results on Europarl corpus and web data show that our method leads to significant improvements over the baseline systems.3.Noises and model pruning in pivot-based machine translation.The noise of the phrase table is a key problem in SMT.It is caused by many reasons,including: 1)the noises in parallel corpuses,2)the defects of the learning algorithm,and so on.Because the pivotbased phrase table is generated by combining two standard SMT phrase table,the noises in a standard phrase table might be transferred and amplified in the pivot-based phrase table.Due to the ambiguities of the pivot language,source and target phrases with different meanings may be wrongly matched.Consequently,the derived source-target phrase table may contain incorrect phrase pairs.To alleviate this problem,we apply the minimum Bayesrisk method to prune the phrase table.The minimum Bayes-risk pruning method removes the phrase pairs with the lowest risk from the phrase table.Experimental results on Europarl data show that the proposed method can both reduce the size of phrase tables and improve the performance of translations.4.The lexical reordering model in pivot-based machine translation.In phrase-based statistical machine translation,phrase reordering is a very important issue.Recent years,various phrase reordering methods are presented for SMT system to generate a fluent translation sentence.Among these reordering models,the lexical-ized reordering model is a commonly used method in current SMT systems.For each phrase pair,the lexicalized reordering model defines three types of orientations: directly follows a previous phrase(monotone),swapped with a previous phrase(swap),or not connected to the previous phrase(discontinuous).When applying the lexicalized reordering model into the triangulation method,a key problem is that the context information is missing in the phrase table.In this paper,we present a context-extended phrase reordering model for pivot-based statistical machine translation by extending the context information in source,pivot and target language.Experimental results show that our method leads to significant improvements over the baseline system.

Keywords/Search Tags:

statistical machine translation, pivot language, phrase table, random walk, reordering, model pruning

PDF Full Text Request

Related items

1	On Key Technologies For Pivot-Based Statistical Machine Translation
2	Research On Chinese-Vietnamese Phrase Machine Translation Method Based On Pivot Language
3	A Study On Reordering Issues Of Phrase-Based Statistical Machine Translation
4	Research On Phrase-based Statistical Machine Translation
5	Research On The Key Technologies For Phrase-based Statistical Machine Translation Models
6	Statistical Machine Translation Research And Applications
7	Translation Knowledge Acquisition In Corpus-based Machine Translation
8	Reordering Of Source Language Sentences For Statistical Machine Translation
9	The Study On Phrase-Based Statistical Machine Translation System
10	Pivot-based Statistical Machine Translation for Morphologically Rich Languages