Font Size: a A A

Research And Implementation Of Multi-source Text Topic Detection Based On Fusion Clustering

Posted on:2021-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:X Z WangFull Text:PDF
GTID:2428330632462686Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the era of big data,the Internet contains a large amount of information,which is published to different information sources,such as news websites,online communities,and social media.Among them,the characteristics of information in different information sources are different on the Internet.The information spread by these information sources has become an important content for people to browse on the Internet and has created lots of hot topics in people's daily life.The detection of hot topics on the Internet plays an important role in the supervision of public opinion by relevant departments.Sometimes public opinion managers need to detect hot topics not only from a single source of information but also from a variety of categories,such as a text set with information from both news websites,online communities and social media.The existing hot topic detection technology is mainly applied to a single type of information source,which is not suitable for the above multi-source information hot topic detection.At the same time,the detection results of hot topic detection technology are a series of representative words,which are sometimes difficult for public opinion managers to understand.In order to solve these problems,this paper designs and implements a multi-source text hot topic detection model based on fusion clustering.According to the feature differences between different information sources,this topic detection model designs and implements a multi-source text feature fusion method,in which the long text is abstracted based on TextRank algorithm,and the short text is semantic extended based on the synonym of Harbin Institute of Technology,to solve the problem of feature differences between different information sources.On the other hand,a text clustering method based on the Dirichlet Multinomial Mixture model is designed and implemented to solve the problem that multi-source text features are still sparse after feature fusion.It will obtain many hot topics after the information from different sources passes the multi-source text topic detection model mentioned above.Each hot topic uses three topic candidates to represent the content of the hot topic.This paper designs and implements a topic semantic representation model based on the fusion of strategy and deep learning sequence model to generate topic tags that are similar to and smooth with the meaning of topic candidate words,so as to help public opinion managers understand the real meaning of these hot topics more easily.At the same time,this paper designs and implements a multi-source text hot topic detection system,which makes it easier for public opinion managers to use the hot topic detection model and the topic semantic representation model.The multi-source text hot topic detection system provides functions such as data source crawling,data persistence storage and hot topic visualization display and so on,which can effectively help the government or relevant departments to guide or intervene public opinion and promote relevant departments to enter supervision.In order to verify the effect of the multi-source text hot topic detection model,the multi-source text data set composed of the Chinese data set of Fudan University and Sina Weibo data is selected for comparative experiments in this paper.The experimental results show that the clustering effect of this model is better than that of other reference clustering models.At the same time,in order to verify the effect of the topic semantic representation model,Bilingual Evaluation Understudy score and manual evaluation are selected for the comparative experiments.Experiments show that the semantic representation effect of this model is better than other reference semantic representation models.
Keywords/Search Tags:multi-source text, hot topic, fusion clustering, sequence model, semantic representation
PDF Full Text Request
Related items