Font Size: a A A

Research On Quality Estimation Based On Pre-trained Language Models

Posted on:2022-07-27Degree:MasterType:Thesis
Country:ChinaCandidate:Q Y MengFull Text:PDF
GTID:2518306572451014Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the vigorous development of computer science and technology,machine translation in which computers replace humans to translate between different languages has gradually become the mainstream.Since the machine translation is not yet mature,the automatic evaluation of the obtained translation is helpful for people to screen out the most suitable machine translation.As an automatic evaluation of machine translation,it does not require the reference,machine translation quality estimation(QE)can evaluate the quality of translation only using the source sentences and machine translations.Therefore,it is widely used in the case of no reference.Although translation quality estimation has many advantages,it needs professional translators to obtain the manual post-editing of machine translation.The QE task generally has the problems of small data set and the scarcity of data.In recent years,the pre-trained language models have achieved remarkable results in various tasks of natural language processing by the characteristics of pre-learning general knowledge representations from a large amount of unsupervised data and migrating them to downstream tasks.Therefore,this paper focuses on the QE task,introduces the pre-trained language models,and carries out the research of QE based on the pre-trained language models from three aspects of model design,strategy optimization and data augmentation.The main research contents and contributions of this paper are as follows:Firstly,this paper proposes a quality estimation model based on the pre-trained language models of bidirectional semantic representation.For the analysis of existing models based on the "Predictor-Estimator" structure,there are problems with inadequate predictor representation and differences in the training stages of the predictor and estimator.In this paper,a pre-trained language model of bidirectional semantic representation is used to replace the original predictor that can only be represented in one direction,and the pre-trained language model uses fine-tuning method to jointly update the parameters with estimator in the quality estimation stage.In addition,the model proposed in this paper is universal and is suitable for most monolingual pre-trained language models.In the part of experiment,this paper selects several mainstream pre-trained language models for performance comparison.In the end,the ELECTRA-based quality estimation model proposed in this paper achieved the best results,and surpassed the baseline model in the sentence-level tasks of the WMT2017 EN-DE and CCMT2019EN-ZH datasets.Secondly,this paper explores the replaced token detection strategy suitable for the QE task.The default pre-trained language model training strategy may have insufficient optimization for downstream QE task.Therefore,this paper selects Electra which has the best performance in the previous chapter as the research goal.At first,this paper describes the relationship between the replaced token detection mechanism of ELECTRA and QE task.Then,the specific research is carried out from the following four aspects: the strong and weak relationship between generator and discriminator,different token replacement ratio,selective token substitution according to the part of speech and replaced token detection strategy combined with MLM.By adjusting different strategy settings and observing the impact on quality estimation results,an optimized replaced token detection strategy is obtained,which improves the performance of the model.Finally,this paper proposes a pseudo data generation method for quality estimation based on ELECTRA.In this paper,we use ELECTRA generator to replace some words in the sentence to get the new sentence after rewriting,and combine it with the source sentences,manual post-edited sentences and the corresponding quality labels to get the pseudo data of quality estimation at sentence-level and word-level.For sentence-level pseudo data,this paper constructs pseudo data with different data distribution characteristics based on different input sources,and obtains suitable training strategies for different pseudo data;for word-level pseudo data,this paper generates more reasonable pseudo data,which solves the problem of unbalanced label distribution of training data.Experiments show that the performance of the model is further improved by using the pseudo data constructed by the proposed method to participate in the training,and achieve the best results on the CCMT2021 EN-ZH and CCMT2021 ZH-EN sentence-level QE tasks.
Keywords/Search Tags:Machine Translation, Quality Estimation, Pre-trained Language Models, Replaced Token Detection, Pseudo Data
PDF Full Text Request
Related items