Research On Unsupervised Machine Translation Bilingual Corpus Mining For Low Resource Languages

Posted on:2024-03-31

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Liu

Full Text:PDF

GTID:2568307076973559

Subject:artificial intelligence

Abstract/Summary:

PDF Full Text Request

Currently,machine translation is widely used in industries,education,culture,and other fields.Therefore,improving the quality of machine translation has become one of the research directions in the field of natural language processing.As the premise and support of machine translation,the quantity and quality of parallel corpus play a crucial role in the performance of machine translation.However,manual acquisition of parallel corpora has the disadvantages of time-consuming,costly,and low output.How to quickly and accurately automatically mine parallel corpora has attracted increasing attention from researchers in the field of natural language processing,especially for parallel corpora of low resource languages.Currently,there are many algorithms for automatic mining of parallel corpora,but there are still several problems to be solved:(1)The mining process of parallel corpora still uses supervised methods,which require a large amount of annotation data,but low resource languages do not have a significant amount of bilingual annotation data to use;(2)Most parallel corpora are resource rich languages and the content of the corpus is related to mainstream fields.There is little research on low resource offensive languages;(3)In bilingual mining,classifier training mostly uses random negative samples,ignoring the impact of negative samples on classifier performance.Based on the above issues,an unsupervised bilingual corpus mining method in low resource language scenarios is proposed,and further research is conducted on aggressive content mining in low resource languages.Aiming at the problem(1),this paper proposes a low resource unsupervised parallel corpus mining method.The proposed method improves the efficiency of obtaining candidate bilingual corpus from web pages by adding a time window.In order to achieve unsupervised bilingual corpus mining,a small-scale and high-precision bilingual seed dictionary is constructed by inducing bilingual signals and establishing cross language mappings in the monolingual corpus to obtain bilingual semantics.Finally,using the Uighur Chinese machine translation system to evaluate the low resource corpus mined in this article,the results show that the translation quality has significantly improved compared to the baseline.Aiming at problem(2),this paper proposes a regularization training method that combines adversarial training and transfer learning.Using sample regeneration methods to enhance low resource language training data to maintain the performance of trained aggressive language recognition models migrating from rich resource languages to low resource languages.Experiments on four low resource languages show that the proposed method does not require any labeled data,and the detection effect is comparable to or even better than the supervised method.Aiming at problem(3),this paper proposes a bilingual corpus mining method that enhances negative samples.This method trains parallel sentence recognition classifiers by adding negative sample types and using samples with different noise rates.The trained classifier is used to filter bilingual sentence pairs that have similar topics but are not parallel,while retaining truly parallel bilingual sentence pairs that are translated from each other.Experiments show that the performance of the bilingual parallel recognition classifier trained using this method is better than the baseline.

Keywords/Search Tags:

Data enhancement, Parallel corpus mining, Classifier, Transfer learning

PDF Full Text Request

Related items

1	Research On Large-Scale Bilingual Parallel Corpus Extraction From The Web
2	A Study On The Key Technologies Of Web-Based Indonesian-Chinese Parallel Corpus Construction
3	Web-oriented Multilingual Parallel Sentence Pairs Mining Techniques
4	Research On The Automatic Construction Of Chinese-Japanese Parallel Corpus
5	The Research On Classifier Ensemble Learning For Data Mining
6	Research On Cross-corpus Speech Emotion Recognition Technology Based On Transfer Learning
7	Classifier Design Based On Feature Transfer And Model Transfer
8	Research On Active Learning Based Automatic Corpus Annotation
9	Data Stream Classification Research Based On Transfer Learning
10	Research On Classifier Combination And Its Relevant Techniques Of Distributed Data Mining