Font Size: a A A

Research On Statistical Machine Translation At Document Level

Posted on:2015-03-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z X GongFull Text:PDF
GTID:1268330428998160Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Machine translation is a hot research topic in Natural Language Understanding. It caneffectively promote information sharing and thus has wide application and research value.Statistical Machine Translation (SMT) is the mainstream of machine translationtechnology in recent years. However, most of SMT systems translate documents sentenceby sentence under strict independence assumptions. Therefore they only utilize limitedsentence context while completely ignore the relationship between sentences and globalinformation of text. Nevertheless, the characteristics of text, such as style, subject andgenre, can serve to disambiguate word sense, keep consistent language style, andespecially convey key information of original texts during translating procedure.The idea of doing machine translation in discourse unit was early put forward in1992,however, most of machine translation systems still work at isolated sentence level. Thereasons are manifold, such as lack of document information in parallel corpus. But slowresearch progress just shows this is a tremendously challenging task. The main content ofthis dissertation includes:1. The research on designing reliable frameworks for document-level SMT.In order to closely simulate human translation process, we first present a cache-baseddocument-level SMT system. These caches fall into three categories and can describe thefollowing text characteristics, background, topic and lexical cohesion respectively.Furthermore, three kinds of feature for SMT log-linear model are designed to utilizeinformation in these caches. Our proposed framework can guide traditional SMT systemsto effectively use document-level knowledge. The second framework is based on N-bestlist produced by SMT system, so we call it as a post-processing procedure. The point ofthis way is to control consistency of topic models between source-and target-side texts.Inspired by the idea of extractive summarization, such system generates final hypothesis collection by dynamically selecting translation hypothesis from N-best list underconsistency assumption of topic model. Both of these frameworks can successfullyintegrate document-level knowledge into SMT systems, and the former can achieve moresignificant improvements according to the experimental results.2. The research on tense model for document-level SMT.Tense research is an effective knowledge expansion of document-level SMT. Thetense model is working on our cache-based SMT system and can integrate rich knowledgeof context. According to temporal continuity in one document, this paper puts forwardN-gram-based tense model, which can reflect tense variation of inter-sentences and intra-sentences. Furthermore, this paper proposes a classifier-based tense model which has moregeneralization abilities. Experiments show the joint of SMT and tense model caneffectively improve translation quality and the best SMT system can be improved0.97percent in BLEU score.3. The research on automatic evaluation metrics for document-level SMT.Translation results should reflect main content of original texts, so we first propose atopic-sentence-driven evaluation metric and a topic-model-based evaluation metricrespectively. Second, document-level translation should keep lexical cohesion and thus anevaluation metric based on lexical chain is proposed. Experimental results show ourproposed evaluation metrics can improve Spearman correlation to human assessments.This dissertation has a comprehensive coverage of core issues of document-levelSMT. Currently the related research at domestic and abroad is still in its infancy. Theresearch work has great innovation in SMT and exhibits a great reference value to thefuture research in document-level SMT.
Keywords/Search Tags:Statistical Machine Translation, Document-level SMT, Cache-basedTechnology, Tense Model, Automatic Evaluation for MT
PDF Full Text Request
Related items