Research On Unsupervised Cross-lingual Word Embedding Model Based On Feedback System

Posted on:2021-05-26

Degree:Master

Type:Thesis

Country:China

Candidate:J P Guo

Full Text:PDF

GTID:2428330611499990

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The cross-lingual word embedding can represent words of two or more human languages in a common space,providing basic support for various semantic calculations and knowledge transfer.Most cross-lingual word representation methods require some form of supervised knowledge to train the model.However,the lack of resources has become a bottleneck problem restricting various cross-language tasks.The unsupervised method does not require any parallel corpus or bilingual dictionary,only relying on monolingual corpora of various languages,it can automatically learn cross-lingual word representation and translation dictionaries.So it has very important research significance.However,there is a big problem of existing method: it requires the word vectors trained separately on the monolingual corpus to satisfy the isomorphism assumption.This paper proposes a cross-language word vector co-training method based on feedback mechanism,which aims to make word vectors naturally have cross-language properties.This paper firstly improves the baseline model from various angles,including the study of word vector initialization,initial dictionary performance,mapping methods and distance measurement methods.Finally,it is determined that the use of Iterative Normalization and Cross-domain Similarity Local Scaling and other technologies can improve the accuracy of dictionary extraction.After that,the criteria for obtaining the training dictionary are given.The criteria of word selection and the size of the candidate word list are studied respectively.And finally the training dictionary is obtained.The method of dictionary extraction based on certitude is found to be more effective.Using the training dictionary,word vectors are co-trained using replacement-based method and loss function-based method,respectively.These two methods are based on different ideas,but both hope that after the training of the word pairs in the dictionary,their corresponding word vectors will be close to each other in the space.At the same time,the results of the two methods show that in order to obtain the word vectors with cross-language properties,the information of the word vectors should be close to each other in space,such as the parameter matrix,rather than the word vectors finally obtained by training.Finally,through experiments,we can see that the accuracy of dictionary extraction by both methods is higher than that of the baseline model.Explain that the method proposed in this article is effective.

Keywords/Search Tags:

Cross-lingual word embedding, Unsupervised learning, Feedback system

PDF Full Text Request

Related items

1	Research On Unsupervised Cross-lingual Mappings Of Word Embeddings
2	Research On Cross-lingual Word Embedding Construction Methods Based On Deep Semantics
3	Cross-Lingual Text Classification Based On Monolingual Word Embedding Mapping Without Parallel Corpus
4	Research On Mongolian-Chinese Cross-Lingual Word Embedding Learning Based On BERT
5	Research On Machine Reading Comprehension Model Based On Cross-lingual Transfer Technology
6	Unsupervised Cross-lingual Word Representation Learning Method Based On Co-training
7	The Research On Learning Cross-lingual Word Embeddings Based On Adversarial Training
8	Research On Chinese-korean Cross-lingual Text Classification Method Based On Bilingual Topical Word Embedding Model
9	Cross-lingual And Cross-domain Transfer Learning For Text Classification
10	Research On Cross-lingual Word Similarity Computation