Font Size: a A A

Key Technology Research On Short Text Similarity

Posted on:2017-12-27Degree:MasterType:Thesis
Country:ChinaCandidate:L Q LiuFull Text:PDF
GTID:2348330488975452Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As computer science and technology and things continued to grow, more and more data in the form of short text appeared on the Internet such as news headlines, post bar speech, microblogging messages. These short text data classification and clustering, to dig out useful information to provide more valuable information to create value, in order to meet the needs of different aspects, so the short text data mining has increasingly become an urgent task and subject.Firstly, the concept of short texts are introduced, and short text facing two major problems, the first is due to the short text sparsity, but can’t use common text algorithms, or can’t obtain and long text the same effect; the second is the lack of context-dependent short texts not provide effective background information; then separately for several major conventional method of calculation of long text similarity analyzes, including major cosine similarity algorithm based on VSM and based on semantic dictionary of semantic similarity algorithm; Finally, the current primary short text similarity calculation method includes statistics based on large-scale text feature set and based on the description of the method; this article is not only the length of the text similarity calculation methods and their respective features, and analyzes the shortcomings.Then paper introduces the principle of LDA topic model and important parameters. LDA topic model can dig deeper intrinsic semantic short text, so that the similarity of short text short text computing is not limited to the structure of language, and data modeling and semantic direction is calculated from the inherent potential short texts; LDA think multiple topics corresponding to each document, Mr. Cheng documents relating to the distribution level, after the word for each iteration of a sample assigned topic. But for the short text, since the amount of information each document is too small, LDA during the build of the model will produce data sparse problem; then introduced the theme of multi-granularity model of single particle models have benefits relating to possible partial remission essay the problem of data sparsity LDA topic model generation process produces multi-topic model size by the number of different themes, useful information to fully tap the different dimensions of short text data set, in order to improve the degree of association of short text semantic similarity calculation.Finally, the proposed two improved methods to improve essay similarity calculation. First, according to the existing single-topic model granularity to improve short text similarity calculation method, we use a multi-topic model granularity to improve short text similarity calculation method using LDA topic model under different topics for short text training set data modeling, and then LDA model has been trained on the use of short text test set thematic analysis, analysis of two short pieces of text relating to ingredients, if different short text segments with similar themes ingredients, instructions this is more than two texts are semantically associated through this degree of association to increase the level of the similarity values of the two short text segments; a second paper to the original short text fragments characteristic word expansion, combined with the above the method of multi-granularity relating to model-based similarity calculation to improve accuracy, and a method similar to a pair of short text training set data modeling and analysis of the test set different short text short text fragments thematic elements, each essay most of the similarity fragments share several themes relating to components as the subject tag to the inside of each corresponding short text segments to increase the number of short text feature words, if two short text segments with similar themes component, two short pieces of text with the same hash tag, thereby increasing computing two short text similarityExperimental results show that the proposed method can effectively improve the short text classifier performance with respect BuyAns dataset KNN and on KNNMTBS classification performance, based on a combination of short text feature expansion and a mean accuracy of multi-granularity-based methods to improve the 4.1 about%; the proposed method of classification performance on the issue of classification dataset Phan than KNN, KNNMTBS also achieve better classification results, with respect to Phan data set classification performance on KNN and KNN_MTBS, based on a combination of short text feature extended term based on multi-granularity and anaverage accuracy rate of about 5.1%.
Keywords/Search Tags:multi-topic model granularity short similarity text features extension KNN
PDF Full Text Request
Related items