Font Size: a A A

Research On Domain Adaptation For Statistical Machine Translation Based On Topic And Semantic Analysis

Posted on:2019-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:M Y LiuFull Text:PDF
GTID:2405330545951190Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Statistical Machine Translation(SMT)relies heavily on large-scale parallel corpus and builds statistical models by using computer's strong computing power and machine learning algorithms.However,the performance of a SMT decreases when translating domain-specific texts.The reason is that the training data contains several domain information and from which the translation model will learn a variety of translation knowledge and linguistic phenomena.Thus,the translation system is incapable of adapting to the domain-specific semantic and language style.The research on domain adaptation for SMT aims at establishing a method for dynamically adjusting the translation model,to make it have strong ability to learn and process language features of the target domain,ensuring balanced and reliable translation capabilities of the translation system in different domains.We focus on the domain adaptation for SMT in this paper,including the following contents:(1)Domain relevant data selection based on topic informationWe propose a data selection method based on topic information in this research,which aims at extracting domain relevant sentence pairs from large-scale general-domain corpus to expand domain-specific training data and improve the performance of translation system.We utilize bilingual topic model to represent sentence pairs as topic distribution,and construct the mappings between topics and the target domain.This method introduces the underlying semantic information from topic perspective to better estimate the domain correlation of sentence pairs.Experimental results show that our methods increase the translation performance by nearly 1.64 BLEU points.(2)Reordering model adaptation based on topic modelIn this research,we prove that there exists significantly differences in phrase reordering distribution and propose a domain adaptive reordering model which fuses topic information.This research aims at solving the dynamic adaption problem caused by the domain unknown of the test set.Specifically,we analyze the topic information of the corpus and obtain the reordering distribution of phrases under different topics.When decoding,we infer the topic distribution of the test set,and utilize this topic distribution to weight the reordering distribution so as to optimize the reordering distribution of phrase pairs and enhance the performance of cross-domain SMT system.Experimental results show that the reordering model adaptation method can improve the performance by 0.76 BLEU points.(3)Terminology translation error identification and correctionIn this research,we propose a post-processing method of translation system to solve the problem of poor quality in domain terminology translation.We utilize the back translation strategy and convert the terminology translation identification into the quality evaluation of the back translation text.We use three metrics:language model perplexity of the back translation text,tree-edit distance and sentence semantic similarity.Experimental results illustrate that our method can effectively identify and correct the terminology translation errors.Experimental results illustrate that our method improves performance on both weak and strong SMT systems,yielding a precision enhancement of 0.48%and 1.51%respectively.
Keywords/Search Tags:Statistical Machine Translation, Domain Adaptation, Topic Information, Translation Model Optimization, Terminology Translation
PDF Full Text Request
Related items