Font Size: a A A

A Short Text Similarity Calculation Method Based On Feature Extension Using BTM Topic Model

Posted on:2015-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2268330428468667Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet and various mobile terminals, the importance of processing text information with computers is increasingly prominent. The popularity of SMS, twitter and electronic commerce makes it more and more important to express information by short texts. The produce of TB-level data indicating that the era of big data is coming. Digging public opinion trends from simple texts, unearthing consumer’s psychology from product reviews, government can understand the public tendency more closer,and enterprises can have a better understand of user needs with the help of short texts digging. However,short texts’feature sparse makes it difficult to carry out excavation. The main work of this paper is to improve the sparsity of short-text feature matrix. In the field of long text, topic model’s development and application has matured. But for short texts with data sparsity, it can’t be separated from the shadow of long texts yet. There is quiet a lot pf papers devoted to expand short texts to long texts by some relevant information, then use the topic model to model and caculate. As it is difficult to search the related information of some short texts, this method does not have the versatility. This paper use the BTM topic model proposed in May2013IW3C2conference to expend the feature matrix of short text, then use the expended feature matrix to calculate the similarity of texts, experiment show that this method performed well.This paper introduces the principle of vector space model VSM and how to use VSM to calculate similarity of short texts at first. And then briefly describes three kinds of text similarity calculation formula, making experiment to contrast cosine of the angle and JS distance. Finally determine to use JS to calculate the similarity. Then this paper briefly introduces the theory and development of topic model. Briefly introduces the theory of LDA topic model and its parameter estimation method as well as the input and output of GibbsLDA.Introduce the theory of BTM topic model and its parameter estimation method and its input and output with emphasis. In this section, I take a experiment to compare two topic models’ performance in similarity calculation and result shows that BTM topic model proved better. Different from traditional modeling methods, BTM learn the topics by directly modeling the generation of word co-occurrence patterns in the whole corpus. BTM uses the aggregated patterns in the whole corpus for learning topics to solve the problem of sparse word co-occurrence patterns at document-level.Finally, this paper put on using BTM topic model to improve short text feature’s sparsity, after then use the improved matrix to caculate similarity of short text. At first, we use the BTM topic model to infer the document-topic probability distributions and topic-word probability distributions. And then, use these probability distributions to expand the feature matrix of short texts. And at last, I use the expanded feature matrix and JS distance to calculate the similarity of texts. This paper use the collection includes22000questions crawled from a popular Chinese Q&A website and use the classification algorithm KNN on Weka to experiment the effect of this method and result shows that this method performs well.
Keywords/Search Tags:BTM topic model, similarity of short text, VSM, feature extension
PDF Full Text Request
Related items