Font Size: a A A

Topic Detection Based On LDA Model And Density Clustering

Posted on:2017-05-30Degree:MasterType:Thesis
Country:ChinaCandidate:C LiFull Text:PDF
GTID:2348330503480732Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, because of the rapid development of Internet technologies and diversified network terminals, the volume of network news grows fast and the structure of news becomes more complex. It is difficult for traditional methods of news collection, collation and analysis to detect the potential links between the news, and further to determine the development trend of the news from a global perspective. To tackle these problems, topic detection technology is thus developed to automatically detect potential topics from large-scale news. Besides, topic detection can also detect unexpected events and find out their progresses in general. Topic detection has been widely used in opinion monitoring, information security, trade finance and other fields.In this thesis, we study topic detection from large-scale news datasets. The main work is as follows:(1) The proposed method combines LDA model with a density-based clustering algorithm. LDA model is used to reduce the data dimension by expressing the news as a probabilistic distribution on a set of topics, and then extract the most important features for the topics. On the other hand, the used density-based clustering algorithm is more effective on mining structures of the topics.(2) The T-OPTICS algorithm is proposed by considering the time continuity on news topics. This algorithm is developed based on OPTICS which is not sensitive to parameters. Therefore, the influence of parameters on the clustering results has been reduced. Furthermore, the computation method of text similarity is also improved by considering the effect of time parameters.(3) According to the characteristics of the topic detection, we propose an automatic cluster identification method based on reachability plots. The method is based on the idea that a topic is a set of events(or activities) related to a core event(or activity). First, the method identifies all concave sections on reachability plots as events, and then extracts the core features of every event. Finally, the events which have similar core features are merged as topics. It is proved that the proposed method overcomes the shortcomings of being sensitive to parameters when using other cluster identification methods.The experimental results show that these developed methods can detect topics in TDT4 dataset quickly and effectively.
Keywords/Search Tags:topic detection, LDA model, OPTICS, cluster identification
PDF Full Text Request
Related items