Font Size: a A A

Research On Low-resource Cross-language Word Embedding And Sentence Embedding Methods For Chinese And Vietnames

Posted on:2023-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y WuFull Text:PDF
GTID:2555306797982649Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As the main method of feature extraction of multilingual information,cross-lingual embedding aims to map the embedding vectors with the same semantics in different languages to the same space for alignment,so as to realize the modeling and transmission of semantic information in different languages.It is the basis of many cross-lingual tasks,such as cross-lingual text classification,cross-lingual sentiment analysis and so on.At present,the research work on cross-lingual embedding has achieved good results in rich-resource language pairs such as Britain and Germany,but in low-resource language pairs such as Chinese and Vietnamese,due to the lack of large-scale parallel corpus,the trained cross-lingual embedding can’t achieve accurate semantic alignment,which is one of the urgent problems to be solved in this field.This thesis studies the word embedding and sentence embedding of Chinese-Vietnamese low-resource bilingualism,and mainly completes the following work:(1)A Chinese-Vietnamese cross-lingual word embedding method based on word cluster constraints is proposed.In the traditional cross-lingual word embedding method,limited by the scale and quality of Chinese-Vietnamese bilingual dictionaries,the mapping matrix learned only using word alignment information has poor generalization on non-labeled words outside the dictionary.In fact,there are some synonyms and similar words with similar semantics in the dictionary,which can be constructed as aligned word clusters,so that the mapping matrix can further learn some common features and mapping relationships between similar words in different languages.Therefore,a Chinese-Vietnamese cross-lingual word embedding method based on word cluster alignment constraints is proposed.By using different types of association relations,fully mining the word cluster alignment information in the bilingual dictionary and integrating it into the training process of the mapping matrix,the mapping matrix learns the mapping relationship of two granularity between words and word clusters,which effectively alleviates the problem of poor generalization of the model due to the small scale of the bilingual dictionary.The experimental results show that the alignment accuracy of the method of fusing word cluster alignment constraints in the induction task of Chinese-Vietnamese dictionary is significantly improved compared with the traditional cross-lingual word embedding method,which can effectively improve the alignment accuracy of Chinese-Vietnamese bilingual space in low-resource environment,and is more suitable for the complex cross-lingual application scenario with large language differences and small-scale bilingual dictionary,such as Chinese and Vietnamese.(2)A context based cross-lingual sentence embedding method between Chinese and Vietnamese is proposedDue to the scarcity of sentence level parallel corpus in Chinese and Vietnamese,the multilingual pre-training model lacks clear cross-lingual supervision signals in the training process.The obtained contextual cross-lingual sentence embedding in low-resource language pairs with large grammatical differences like Chinese and Vietnamese can’t achieve better semantic alignment.To solve this problem,considering using small-scale Chinese-Vietnamese parallel sentences to train a fine-tuning layer integrating siamese network structure to reconstruct the Chinese-Vietnamese contextual cross-lingual sentence embedding,and improve the semantic alignment effect of cross-lingual sentence embedding in Chinese-Vietnamese low-resource environment by maximizing the similarity of similar semantic embedding in shared space.Therefore,a context based cross-lingual sentence embedding method between Chinese and Vietnamese is proposed.Firstly,using small-scale Chinese-Vietnamese parallel sentence pairs as positive examples and randomly constructed non-parallel sentence pairs as negative examples,input the multilingual pre-training model m BERT to obtain the corresponding Chinese-Vietnamese contextual cross-lingual sentence embedding.Then,a linear fine-tuning layer with twin structure is constructed to reconstruct the Chinese-Vietnamese contextual cross-lingual sentence embedding obtained in m BERT model,so that the embedding similarity between positive examples is as high as possible and the embedding similarity between negative examples is as low as possible,and the contrastive loss is constructed from this to guide the optimization of fine-tuning layer in reverse.Experiments have shown that the Chinese-Vietnamese contextual cross-lingual sentence embeddings reconstructed by the siamese network fine-tuning layer are closer in the shared space,and the embedding distribution of the two languages have a higher degree of overlap,which can effectively improve the accuracy of Chinese-Vietnamese cross-lingual sentence embedding in the calculation of semantic similarity,and improve the effect of multilingual pre-training model on semantic alignment in Chinese-Vietnamese low-resource environment.(3)Chinese-Vietnamese cross-lingual embedding prototype systemBased on the above research results,a Chinese-Vietnamese cross-lingual embedding prototype system is designed and implemented.Firstly,the models proposed in two research points are iteratively trained with the optimal parameter settings obtained from the ablation experiment.Then write the corresponding interface file access system for the two models to realize the acquisition function of Chinese-Vietnamese cross-lingual word embedding and sentence embedding.Finally,on this basis,complete the development of each functional module and front-end pages.Prototype system integrates Chinese-Vietnamese cross-lingual embedding generation module,visual Chinese-Vietnamese cross-lingual embedding module,Chinese-Vietnamese cross-lingual dictionary induction module,Chinese-Vietnamese cross-lingual sentence semantic similarity calculation module,Chinese-Vietnamese cross-lingual embedding generation interface module,etc.,in order to provide relevant users with visual Chinese-Vietnamese cross-lingual embedding information acquisition platform and interface.
Keywords/Search Tags:Chinese-Vietnamese bilingual, low-resource language, cross-lingual embedding, word cluster constraint, siamese network
PDF Full Text Request
Related items