Font Size: a A A

Short Text Topic Mining Based On W-BTM And Text Classification Application

Posted on:2018-10-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y J ZhangFull Text:PDF
GTID:2348330512498795Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet and all kinds of social networking sites,unstructured information,represented by text,has emerged.It is more and more important to mine the information from the texts in the Internet.But It is also becoming more and more difficult to find out valuable information at the same time,which caused by the complex semantic.Because of the sparsity and incompleteness of short texts,text mining faces new challenges.Therefore,the study of text information mining turned to the research of the short text mining gradually.BTM is one of the topic model aimed at short text mining.It has a great advantage to deal with the sparse and incomplete problem.But the existing text mining models,including the BTM model,have no special parameters set for processing.They just load in stopwords list at the data preprocessing time,then do the delete operation.This way in un scientific.Because the different corpus have differences.So for the corpus of different set,we should find out the stopwords which can reflect their characteristics.Based on the considerations of features and stopwords,this paper puts forward WBTM topic model.In the model,the paper uses difference coefficient to represent the weight of the words in the text,which called weight model.Then the paper uses it as one of the BTM's parameter to form the W-BTM model finally.In this way,the influence of short text and stopwords has been eliminated.The model estimates the parameters by Gibbs sampling.In this process,sampling from the prior distribution of the latent variables and then uses it to estimate the posterior parameters.At last,this paper applies the W-BTM topic model to book synopsis data which is collected from dangdang.In the experiment the paper classifies the short text in the results which obtains from topic models by support vector machine.The comparison of different classification results proves the superiority of W-BTM model.The weight of each word in the entire document is known.This is what the W-BTM model based on.In this case,the "biterm" is used to distinguish whether the word is stopword.This can eliminate the effect of inappropriate stopwords on the accuracy of text information mining.In order to verify the validity and scientific of the W-BTM topic model,the experiment evaluates the result from two aspects: theme mining and text categorization,compared with the LDA model and BTM model.Finally this paper proved that the WBTM model is better than the model of LDA and BTM.The innovations are in the three aspects:(1)For the processing of stopwords,this paper proposed a method of weigh model instead of the way which select stopwords list and let them out directly.Thus,the result of text classification will be more and more scientific and accurate.(2)Combined weighting model with the BTM topic model,the W-BTM model can not only solve the data sparseness of short text,also make up the data preprocessing with stopwords.(3)This paper collected the data about book descriptions in dangdang website.Then the W-BTM model was applied to it for topic mining.So the model has it's own practical significance.During the process,the paper dealt with the unbalance of the data,used the W-BTM model to mine topic and classified the documents-topic matrix by Support Vector Machine.The result can verify the availability of W-BTM model,and also,W-BTM model is better than LDA and BTM.
Keywords/Search Tags:W-BTM topic model, topic mining, short text, text classification
PDF Full Text Request
Related items