Font Size: a A A

Research On The Prototype System Of Sino-Tibetan Cross-language Tourism Field Relationship Extraction And Knowledge Base Construction

Posted on:2020-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:X L FengFull Text:PDF
GTID:2435330575996409Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet has led to the emergence of more and more Chinese-language travel websites in the network,providing tourists with a wealth of tourist information.However,the information on the Chinese travel website is so complicated that it is difficult for people to quickly and accurately understand the comprehensive information of the scenic spots from the vast amount of unstructured text.In contrast,the data in the Tibetan tourism field is very scarce.How to use the knowledge of resource-rich language to assist the construction of knowledge base in the field of Tibetan tourism,and how to extract tourism domain knowledge from massive,multi-source and unstructured data in resource-rich languages has important research value.In view of the above problems,this paper mainly studies the relationship extraction and knowledge base construction in the cross-language tourism field between China and Tibet.The main work and innovations are as follows:(1)For lack of Tibetan scenic spots corpus and hard to obtain directly.Firstly,this paper extracted the attribute relationship in the resource-rich Chinese tourist corpus to obtain comprehensive knowledge of scenic spots,and then transferred the knowledge to Tibetan.Secondly,This paper used BiLSTM neural network model to extract relations in Tibetan tourist text after analysis of the characteristics of Chinese tourist areas text.In this model,in order to enrich the semantic representation ability of word vectors,this paper combines part-of-speech features and position features in the word vector model.The comparison experiments show that the word vector representation method of fusion rich feature information has better results than the traditional word vector representation method.(2)In view of the present Chinese-Tibetan machine translation system has yet to reach the actual application level,and lack of tourist dictionary,how to translate the knowledge of Chinese-language attractions to Tibetan is a difficult problem.To this end,this paper has carried out research on the construction of Chinese-Tibetan tourism domain dictionary based on multi-data sources(such as Wikipedia and Baidu Encyclopedia)and the dictionary expansion method based on Chinese-Tibet cross-language word vector.Firstly,based on multi-data sources,a high-quality Chinese-Tibetan tourism domain dictionary is constructed,and the acquired Chinese-language tourism knowledge is translated into Tibetan.While ensuring the accuracy of the translation,the average translation coverage of the experiment reached 70.44%,achieving the goal of migrating the knowledge of resource-rich languages to the low-resource language field.Secondly,in the study of dictionary expansion based on cross-language word vector,this paper uses both supervised and unsupervised methods to learn the linear mapping relationship between Chinese and Tibetan cross-language word vectors.In the experiment,when 500 Chinese characters are translated into Tibetan by cross-language word vector mapping,the highest accuracy rate is up to 40.64%,which proves the effectiveness of the method.(3)In view of the fact that the texts of the scenic spots provided in the current travel websites are long,different types,and lack of short descriptive abstract texts,this paper first used the pattern matching method to extract the paragraph level summary information.On the basis of considering the characteristics of the paragraph information such as"location information" and "affiliation information",the natural paragraphs satisfying certain features are extracted fr-om the travel text as the description attribute text information of the attraction.In the experiment,the F1 value was up to 88.06%.Secondly,this paper also proposed a descriptive text generation method based on the attribute relationship of the attractions,which uses the attribute knowledge of the relationship extracted and the text generation to obtain the descriptive summary information of the attraction.Finally,this paper used two methods to generate descriptive summary information of 30 scenic spots,and compared the validity of the descriptive text content obtained by the two methods.(4)Construct an information retrieval prototype system for cross-language tourism in Chinese and Tibetan.The acquired Chinese and Tibetan tourism domain knowledge is stored in a database format,and the knowledge base of Chinese and Tibetan tourism fields is automatically constructed.The tourism knowledge base contains 844 Tibetan scenic spots,with a total of 5367 attractions.On this basis,a Chinese-Tibetan bilingual knowledge retrieval system based on C/S structure is designed and implemented.The system provides a visual display of travel knowledge and supports both Chinese and Tibetan search and display functions.The main innovations of this paper are as follows:(1)Using the method of cross-language word vector to transfer the rich knowledge of Chinese tourism to the low-resource Tibetan to construct the Tibetan tourism knowledge base,and provide research ideas for the construction of knowledge bases in other low-resource language fields.(2)Using pattern matching paragraph extraction and text generation to obtain descriptive paragraph texts of scenic spots,it provides academic reference value for studying the extraction of entity triples information beyond sentence level.
Keywords/Search Tags:Relationship extraction, cross-language word vector, knowledge base, Chinese-Tibetan tourism information retrieval system
PDF Full Text Request
Related items