Font Size: a A A

Research And Verification On The Construction Method Of Cross-language Retrieval Data Sets

Posted on:2021-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:M ChenFull Text:PDF
GTID:2428330605964010Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,users have more and more demands for information on the Internet.They not only want to retrieve information in a single language,but also in other languages.Therefore,cross-language information retrieval has attracted the attention of many researchers and is one of the research hotspots of information retrieval.The cross-language retrieval system can retrieve relevant information for queries entered by the user in one language from documents in another language or multiple languages,which is helpful for many search engine users who do not know a foreign language to retrieve richer results and obtain multi-language information materials,so as to meet the information needs of users.The research of cross-language information retrieval is of great significance.On the one hand,the emergence of cross-language information retrieval technology can solve the problem of users' demand for multi-language information to a certain extent.On the other hand,cross-language information retrieval is an important part of information retrieval,so it is necessary to enrich and improve the theoretical system of information retrieval to study cross-language information retrieval.At present,deep learning technology has achieved good results in single language retrieval,but it has not been widely used in cross-language information retrieval.One of the reasons is that there is no suitable data to train the neural retrieval model in cross-language information retrieval.In order to better implement cross-language information retrieval.we propose a simple and flexible data set construction scheme.Our English-Chinese bilingual dataset is constructed from data on Wikipedia and supports the training and evaluation of cross-language information retrieval models between English queries and Chinese documents.Our dataset consists of three parts:English query,Chinese document,and correlation judgment between documents.According to the degree of correlation between the articles in the Chinese document and the articles in the English query,we divided the document's correlation level into three categories,namely the most relevant document,the sub-related document and the irrelevant document.To verify the usability of the above data sets,we propose a neural retrieval model based on BiLSTM and attention mechanism for cross-language information retrieval.Different from the traditional cross-language retrieval methods,the neural retrieval model based on BiLSTM and attention mechanism does not need explicit translation process:it can encode the text of the source language and the target language into the same cross-language semantic space,and then calculate the relevance based on the encoded text vector.Experimental results show that the data set we built based on Wikipedia can support the model for successful training and testing,and the model performs better than the benchmark model in the test set.
Keywords/Search Tags:Cross-language retrieval, Data set construction, Neural retrieval model, Deep learning
PDF Full Text Request
Related items