Font Size: a A A

Topically-Informed Bilingually-Constrained Recursive Autoencoders For Statistical Machine Translation

Posted on:2019-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:Z W RuanFull Text:PDF
GTID:2415330548486867Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Learning high-quality phrase vector representations is one of important research topics in statistical machine translation(SMT).Inspired by the success of monolingual phrase em-beddings,many approaches have been presented to implement bilingual phrase embedding for statistical machine translation(SMT).The intuition behind them is that a phrase and its correct translation should share the same semantic meaning,and thus,they should be embed-ded closely to each other in the shared embedding space.However,only semantic composi-tions of internal words within phrases are considered in the existing work,while the semantic information beyond phrases are ignored,which limits the potential of the learned phrase vec-tor representations.In this paper,we propose Topically-informed Bilingually-constrained Recursive Auto-encoders(TBRAE),which addresses the aforementioned issue in learning contextual bilingual phrase embeddings.Incorporating contextual clues into phrase vec-tor representations,our model substantially extends the Bilingually-constrained Recursive Auto-encoders(BRAE)by exploiting latent topics in two ways.Our first inspiration comes from the observation that the meanings of phrases are often context-dependent.Hence,we represent the document-level context of each phrase with its document-topic distribution,which can be incorporated with the RAE to produce the topical phrase embedding.Our sec-ond inspiration derives from the observation that the word topic assignments outputted by the latent topic model reflect the semantic correlations between words and topics,which can be used to constrain the learning of their embeddings.To this end,we design word-topic semantic constraints to encourage words with similar topic assignments to be placed closely in the embedding space.Comparing with BRAE,the TBRAE model not only considers the document-level context beyond the phrases,but also directly models the interactions between word and topic embeddings,both of which are the bases of topical phrase embeddings.Ex-periment results on Chinese-English translation show that the proposed model significantly improves the translation quality on NIST test sets.
Keywords/Search Tags:Topical Phrase Embeddings, Bilingually-Constrained Recursive Autoen-coders, Statistical Machine Translation
PDF Full Text Request
Related items