Font Size: a A A

The Research Of Text Sample Extension Method Based On Wikipedia And Its Application

Posted on:2019-01-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y T LiuFull Text:PDF
GTID:2428330563991729Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the advent of the Internet era,a great deal of text information is waiting to be processed.And the research of text processing is more and more important.At present,most of the text processing methods are based on machine learning,in which the sample is the key factor that influences the learning result.However,the lack of reliable and efficient processing makes it difficult to auto-mark text,resulting in high-quality samples that are difficult to obtain and expensive.In order to solve this problem,this paper proposes a text sample extension method based on Wikipedia.This method makes full use of the semantic information and the structural characteristics of Wikipedia.By calculating the correlation between the labeled sample data and the Wikipedia items,the sample is expanded according to the linkages between the entries.The method increased efficiency of sample expansion and performance of text processing applications.The main work and innovation of this paper are as follows:1)According to the semantic features of Wikipedia items,a Thematic InformationCorrelation(TIC)algorithm is proposed based on the thematic information and textsemantic correlation.2)Based on the structural characteristics of Wikipedia items,we quantify the relationshipbetween links and semantics in Wikipedia and propose a Link Semantic Correlation(LSC)algorithm.3)According to the above correlation calculation methods,three methods of sampleextension are proposed: Wikipedia Sample Extension with Themes(WSE-T),WikipediaSample Extensions with Links(WSE-L)and Wikipedia Sample Extensions with Themesand Links(WSE-TL).4)According to the sample extension methods of this paper,we apply them to the researchof text classification and text cluster,and we compare them to semi-supervised learningand supervised learning.From the number of sample extensions,the test data sets,thenumber of classes and the models,we experiments and analysis these methods.The results show that the sample extension method in this paper can effectively improve the performance of text processing.
Keywords/Search Tags:Sample Extension, Wikipedia, Text Processing, Correlation
PDF Full Text Request
Related items