Font Size: a A A

Research On Unsupervised Machine Translation Bilingual Corpus Mining For Low Resource Languages

Posted on:2024-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LiuFull Text:PDF
GTID:2568307076973559Subject:artificial intelligence
Abstract/Summary:PDF Full Text Request
Currently,machine translation is widely used in industries,education,culture,and other fields.Therefore,improving the quality of machine translation has become one of the research directions in the field of natural language processing.As the premise and support of machine translation,the quantity and quality of parallel corpus play a crucial role in the performance of machine translation.However,manual acquisition of parallel corpora has the disadvantages of time-consuming,costly,and low output.How to quickly and accurately automatically mine parallel corpora has attracted increasing attention from researchers in the field of natural language processing,especially for parallel corpora of low resource languages.Currently,there are many algorithms for automatic mining of parallel corpora,but there are still several problems to be solved:(1)The mining process of parallel corpora still uses supervised methods,which require a large amount of annotation data,but low resource languages do not have a significant amount of bilingual annotation data to use;(2)Most parallel corpora are resource rich languages and the content of the corpus is related to mainstream fields.There is little research on low resource offensive languages;(3)In bilingual mining,classifier training mostly uses random negative samples,ignoring the impact of negative samples on classifier performance.Based on the above issues,an unsupervised bilingual corpus mining method in low resource language scenarios is proposed,and further research is conducted on aggressive content mining in low resource languages.Aiming at the problem(1),this paper proposes a low resource unsupervised parallel corpus mining method.The proposed method improves the efficiency of obtaining candidate bilingual corpus from web pages by adding a time window.In order to achieve unsupervised bilingual corpus mining,a small-scale and high-precision bilingual seed dictionary is constructed by inducing bilingual signals and establishing cross language mappings in the monolingual corpus to obtain bilingual semantics.Finally,using the Uighur Chinese machine translation system to evaluate the low resource corpus mined in this article,the results show that the translation quality has significantly improved compared to the baseline.Aiming at problem(2),this paper proposes a regularization training method that combines adversarial training and transfer learning.Using sample regeneration methods to enhance low resource language training data to maintain the performance of trained aggressive language recognition models migrating from rich resource languages to low resource languages.Experiments on four low resource languages show that the proposed method does not require any labeled data,and the detection effect is comparable to or even better than the supervised method.Aiming at problem(3),this paper proposes a bilingual corpus mining method that enhances negative samples.This method trains parallel sentence recognition classifiers by adding negative sample types and using samples with different noise rates.The trained classifier is used to filter bilingual sentence pairs that have similar topics but are not parallel,while retaining truly parallel bilingual sentence pairs that are translated from each other.Experiments show that the performance of the bilingual parallel recognition classifier trained using this method is better than the baseline.
Keywords/Search Tags:Data enhancement, Parallel corpus mining, Classifier, Transfer learning
PDF Full Text Request
Related items