Font Size: a A A

Research On Microblog Topic Sequential Feature Extraction Algorithm Based On LDA-WO Mixed Model

Posted on:2019-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:M T QiuFull Text:PDF
GTID:2417330596950289Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
In the era of web2.0,the explosion of Internet data has generated over 100 million new micro-blog daily,far beyond the capacity of personal information processing.In the face of massive text information,the study of how to extract useful information quickly and accurately,that is,topic information extraction,becomes very necessary.Currently,LDA(Latent Dirichlet Allocation,LDA for short)topic model has a good effect on the topic extraction of micro-blog text and has been widely used,but there are still some shortcomings:(1)the model ignores the different ability of different words to differentiate the topics,leading to inaccurate result;(2)the result of the topic modeling is disordered,with poor comprehensibility,which makes it difficult for users to accurately deduce the content of the text corpus according to the topic.Therefore,a new microblog topic feature extraction algorithm is proposed in this paper,which improves the accuracy and readability of the extraction results.First of all,the related theories of information extraction,topic model and word order were reviewed and summarized,and the relevant part was selected as the research foundation.Then,in order to solve the problem of inaccurate extraction results of LDA model,concerning the different ability of different words to differentiate the topics,this paper extended the LDA model.Next,to solve the problem of poor readability of the feature words extracted by the LDA model,this paper constructed the WO(word order)model based on word order theory and linguistic diagram model,sorted the extracted feature words to obtain more readable and ordered feature words,and proposed an OPMI algorithm based on the word co-occurrence thinking,expressing the theme in the form of ordered feature phrases.In addition,combining the extended LDA model with the WO(word order)model,a LDA-WO hybrid model was constructed to improve the readability of the extracted results.Finally,the real data was used to verify the effectiveness of the algorithm.The specific innovations are as follows:(1)in order to solve the problem of inaccurate extraction results of LDA model,concerning the different ability of different words to differentiate the topics,this paper constructed the KWFP-ITP algorithm based on the TF-IDF algorithm to extend the LDA model.(2)in order to solve the problem of poor readability of the feature words extracted by the LDA model,this paper constructed the WO(word order)model based on word order theory and linguistic diagram model,sorted the extracted feature words to obtain more readable and ordered feature words,and expressed the theme in the form of ordered feature phrases based on the word co-occurrence thinking.
Keywords/Search Tags:Weibo topic, LDA, WO, text mining, topic model
PDF Full Text Request
Related items