Font Size: a A A

A Computational Research On The Unit Of Translation For Automatic Bitext Alignment

Posted on:2017-02-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:K X LiFull Text:PDF
GTID:1485305102990149Subject:Foreign Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
Bitext alignment is an essential part of parallel corpus processing.The lack of,or the sheer absence of linguistic knowledge in automatic bitext alignment has severely reduced the linguistic value of the alignment results.Studies of the unit of translation have been focused on describing the process in which translators choose suitable sources text units and transfer them into target text equivalents.Therefore,if a bitext alignment model is established on the basis of the unit of translation,it will enable the computer to simulate the mode of thinking of human translators,thus offering better guidance for automatic bitext alignment tasks.This dissertation first analyzed the major definitions of the unit of translation in both a qualitative and a quantitative manner,revealing an array of definitive features of the unit of translation.It then expounded on how these features could be computed mathematically and then integrated into a statistical model of bitext alignment from the perspective of computational linguistics so as to automatically extract units of translation and their equivalents from parallel corpora.The results of the alignment model were further investigated in order to find out some regularities of the unit of translation as well as their implications for machine translation,especially example-based machine translation.The study provided an overview of contemporary researches on the unit of translation,pointing out that the term,albeit frequently used,is still filled with ambiguity and that investigators have not reached a consensus upon its definitive features.The comparison between the heterogeneous definitions of the unit of translation yielded four definitive features(or attributes)of the unit of translation,namely,compactness,independence,lack of ambiguity and lack of correspondence.These attributes are used to analyze if a linguistic unit in the source text may be properly deemed a unit of translation from different angles,such as whether this unit is composed of some elements so closely knit together that it should be translated as a whole,whether it can be rendered in isolation of its context,whether it is monosemous or whether all of its components can be mapped onto the target text units.The underlying difference between the various definitions of the unit of translation is that each of them only emphasizes one of the definitive features or attributes mentioned above.Meanwhile,a quantitative research was conducted,in which the units of translation in a parallel corpus of 491 sentence pairs extracted from NIST 2002 test bed for machine translation evaluation program were manually marked in accordance with the above-mentioned four features.The statistical observation on the annotated corpus shows that a linguistic unit in the source text may or may not be a unit of translation depending on the definitive feature adopted.Specifically,such criteria as compactness,independence and lack of ambiguity tend to demarcate a word together with its contexts as a unit of translation,whereas a unit of translation in accordance with the criterion of lack of correspondence is usually a single word or a phrase.A bitext alignment model based on the unit of translation was then formulated that takes into account the four attributes mentioned above.A working definition of the unit of translation was given first,followed by a discussion on how these attributes can be computed with statistical values and linguistic resources available.Among the major computational methods are GIZA++ word alignment models,which are used for extracting word-level correspondences between the source text and the target text;MIPD(short for "mutual information potential difference"),which is a newly invented measurement that integrates compactness with independence for source text unit analysis,and Vector Space Model,which assesses the semantic distance between the target text units that are equivalents of the same source text for ambiguity analysis.The linguistic data employed consist of a large parallel corpus along with Google Web 1T 5-gram corpus,both of which are aimed at reducing data sparseness.The mechanism of the proposed unit-of-translation based bitext alignment model is as follows:first,both the source text and the target text are annotated with part-of-speech taggers and parsers separately;then the texts are aligned at the word level with GIZA++statistical alignment toolkit,the results of which are used as anchors to extract possible source text units and their equivalents in the target text at other sub-sentential levels.All of these corresponding units are further analyzed in accordance with the four definitive features,namely,compactness,independence,lack of ambiguity and lack of correspondence.Only when the feature values of a source text unit reach the specified statistical thresholds will the unit be deemed a unit of translation.Overall,the present bitext alignment model can align the source text and target text at the level of the unit of translation dynamically.The experiment yielded some findings about the unit of translation.Specifically,the units of translation are in fact a series of source text units which have to be rendered as a whole either because of their compactness,independence,monosemy or the correspondence between themselves and the mapped target texts.In other words,whether a linguistic unit is a unit of translation or not lies with multi-factors,such as the formal and semantic features of the source text as well as the correspondence between the source text and the target text.The size of the unit of translation changes in relation to the analyzing angles of the source and target texts.Therefore,the notion of the unit of translation is,in essence,a dynamic one.The bitext alignment model formulated on the basis the unit of translation not only offers a new vantage point for the research of the unit of translation,but also introduces rich linguistic knowledge to the field of bitext alignment,hence conducive to ameliorating the performance of the corpus-based machine translation systems.The proposed alignment model prevails over the alignment results achieved by GIZA++ toolkit in quantity and quality in that it serves to add new alignment links as well as deleting or cross-checking erroneous links in the baseline word alignment results.In addition,the result of the bitext alignment based on the unit of translation is a valuable source of linguistic data for machine-aided translation.Part of the aligned texts may be used for building translation memories or multilingual term banks,so that translators can refer to them in their work and thus improve the quality and efficiency of translation.The weakness of the present study is that,as a tentative attempt to apply the unit of translation to automatic bitext alignment,it fails to give a thorough investigation of the possible statistical methods and algorithms for computing the unit of translation.The study overly depends on the word alignment of the GIZA++toolkit,whose results are still far from satisfactory.In addition,such alternative natural language processing tools as Wordnet,which,theoretically speaking,may also contribute to the computation of the unit of translation,have been neglected in this research.Finally,the present thesis also points out several directions for further investigation.It is suggested that exhaustive experiments should be conducted to testify the effect of the proposed alignment model on the statistical machine translation system as a whole,and that the unit of translation be further discussed quantitatively using the automatically aligned corpora.
Keywords/Search Tags:Unit of Translation, Parallel Corpus, Bitext Alignment, Statistical Machine Translation, Machine-Aided Translation
PDF Full Text Request
Related items