Font Size: a A A

The Hot-topic Discovery Based On Density Clustering Of Feature Words And Similarity Calculation

Posted on:2014-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:C J HanFull Text:PDF
GTID:2268330401466997Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, Network media has become animportant source of information for people and network public opinion information’sguiding role has become more and more important. At the same time, Internet publicopinion has become a kind of important form of the social intelligence. Althoughnumerous and varied network information plays an active role to the development ofthe society, the Internet public opinion happened widely and spread fast. So thereexists the problem of information security. The harmful information, such asreactionary, obscene, superstition and so on, spread over the Internet, which seriouslyharm the national security and social stability. How to find the Internet hot-topicaccurately and timely has become a hot topic at home and abroad in logistics.Optimizing and improving the related technology of the Internet public hot-topicproblem can grasp the Internet public hot-topic better and improve its efficiency andaccuracy. Internet public hotspot problem mainly considers the feature extraction andclustering algorithm technology. High quality and high speed text clusteringtechnology will distinguish the large numbers of text information into somesignificative clusters. Many researchers have been paying much attention to theclustering algorithm over the past years, such as ARHP, PDDP, K-means, PAM,DBSCAN and OPTICS. These algorithms can cluster texts well. However, thesealgorithms have their limitations, when considering to keywords extraction andsimilarity calculation. However, this paper represents the Internet hotspot problem as abased on keywords extraction and density clustering and similarity calculation problemand optimizes the keywords extraction and similarity calculation at the same time.Under the method of the Internet hotspot, according to the existing problemabout keywords extraction method and optimizing the similarity calculation. Then,new improved algorithm, based on title keywords extraction and improved similaritycalculation and related clustering algorithm, has been designed to realize the Internethotspot discovery. These algorithms are focusing on improving the accuracy aboutInternet hot-topic discovery. The main studies in this paper are as follows: (1) The high or low quality of keywords is closely related to the major points ofthe articles content. Only when fully understanding the content and the exact meaningof the words, which will extract keywords effectively. In order to get high quality ofkeywords, present a method which extracts keywords from the title.(2) When discovering the hot-topic, whether use classification algorithm orclustering algorithm, it needs to analyze the similarity between two vectors forreflecting the text’s real similarity better. According to the giving similaritycomputational formula, present a method which considering the keywords weight intothe similarity formula.(3) Combination of the above-mentioned methods, this paper base on densityclustering bring forward the algorithm for hot-topic discovery based on title keywordsand density clustering and similarity calculation.Finally, via compare experiment data testing and above-mentioned algorithmanalysis, the algorithm shows preferable performance.
Keywords/Search Tags:Hot-topic Discovery Problem, Title Keywords Extracting, KeywordsWeight, Similarity Calculation, Density Clustering Algorithm
PDF Full Text Request
Related items