| The cross-lingual word embedding can represent words of two or more human languages in a common space,providing basic support for various semantic calculations and knowledge transfer.Most cross-lingual word representation methods require some form of supervised knowledge to train the model.However,the lack of resources has become a bottleneck problem restricting various cross-language tasks.The unsupervised method does not require any parallel corpus or bilingual dictionary,only relying on monolingual corpora of various languages,it can automatically learn cross-lingual word representation and translation dictionaries.So it has very important research significance.However,there is a big problem of existing method: it requires the word vectors trained separately on the monolingual corpus to satisfy the isomorphism assumption.This paper proposes a cross-language word vector co-training method based on feedback mechanism,which aims to make word vectors naturally have cross-language properties.This paper firstly improves the baseline model from various angles,including the study of word vector initialization,initial dictionary performance,mapping methods and distance measurement methods.Finally,it is determined that the use of Iterative Normalization and Cross-domain Similarity Local Scaling and other technologies can improve the accuracy of dictionary extraction.After that,the criteria for obtaining the training dictionary are given.The criteria of word selection and the size of the candidate word list are studied respectively.And finally the training dictionary is obtained.The method of dictionary extraction based on certitude is found to be more effective.Using the training dictionary,word vectors are co-trained using replacement-based method and loss function-based method,respectively.These two methods are based on different ideas,but both hope that after the training of the word pairs in the dictionary,their corresponding word vectors will be close to each other in the space.At the same time,the results of the two methods show that in order to obtain the word vectors with cross-language properties,the information of the word vectors should be close to each other in space,such as the parameter matrix,rather than the word vectors finally obtained by training.Finally,through experiments,we can see that the accuracy of dictionary extraction by both methods is higher than that of the baseline model.Explain that the method proposed in this article is effective. |