Font Size: a A A

Design And Implementation Of Korean-Chinese Cross-Language Text Classification Based On Multi-Layer Semantic Feature Alignment

Posted on:2023-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:D H CuiFull Text:PDF
GTID:2545306617493774Subject:Electronic Information (in the field of computer technology) (professional degree)
Abstract/Summary:PDF Full Text Request
In today’s globalized world,communication between different countries and nationalities is becoming more frequent.language difference is an important factor that hinders the further development and progress of globalized communication.Crosslanguage text classification technology enables the organization and management of text data in different languages by overcoming the differences between them.It enables users to locate and use text data in multiple languages more efficiently.In this context,this dissertation conducts research in the field of Korean-Chinese cross-language text classification.By combining cross-language word vectors and adversarial training,we improved the alignment of words and sentences in feature space of the Korean-Chinese cross-language text classification.And based on this,we designed and implemented a prototype system of classification-based inter-Korean-Chinese cross-language text retrieval.First,the Korean-Chinese word features were aligned and the Korean-Chinese cross-language word vector was constructed using a self-learning mapping method.The mapping matrix was trained by constructed a small-scale random seed lexicon.And the mapping matrix was used to obtain new seed dictionaries to continue training the mapping matrix.Finally,the word representations of two languages mapped in the same feature space were obtained by the mapping matrix.So that there was a high similarity between word vectors of different languages with the same semantic meaning.Second,the cross-language feature discrepancy problem was solved by combining the adversarial training mechanism.30,000 Chinese and Korean science and technology literature abstract texts and 40,000 Chinese and Korean news datasets were collected as the corpus.The text feature extraction was achieved by a feature extractor with convolutional neural network and self-attention mechanism.Among them,local text semantic information of text was extracted by convolutional neural network,and the long-range semantic information of the text was extracted by the self-attention mechanism.Then the language species of the input text features was judged by the constructed a discriminator.It made it difficult for the discriminator to determine the language species of the source of text feature which achieved cross-language feature alignment,and finally applied to the task of Korean-Chinese cross-language text classification.Finally,a prototype system of Korean-Chinese cross-language text retrieval was designed and implemented based on the proposed cross-language classification model.The system has three main functional modules: storage module,classification module and retrieval module.We designed the user interaction interface based on pyqt5.The storage module adopted kdtree data structure to organized the data for efficient retrieval.The classification module was mainly based on the cross-language text classification model proposed in this dissertation.The retrieval module constructed the text feature representation by using the feature extraction part of the classification model,and retrieved related Korean and Chinese texts based on cosine similarity.The proposed method achieved good results in the Korean-Chinese crosslanguage text classification task without relying on aligned corpus and target language labeled data.Experiments show that the proposed multi-layer semantic-aligned Korean-Chinese cross-language text classification model has better cross-language text classification performance compared with other cross-language models.It can also improve the accuracy of monolingual text classification tasks.The utilization of small-scale labeled data of the target language is also more efficient.The test results show that the prototype retireval system designed and implemented in this dissertation has good cross-language text retrieval performance.
Keywords/Search Tags:cross-language text classification, word embedding mapping, adversarial training, cross-language text retrieval
PDF Full Text Request
Related items