With the large number of network information, different information sources and continuously dynamic renovation, it's difficult to find interesting information from the massive network information for the people. The dissertation extracts the keywords based on the improved function of TF-IDF, and then makes clustering on the network news, which helps users to find the hot information from the massive electronic texts quickly.Synthetically considering the document categories factor, the location weight factor and the named entities weight factor, we improve traditional function of TF-IDF, and then design the keywords extraction flow from document based on the improved function of TF-IDF. The experimental results show that the accurate rate of the keywords extraction has increased by about 13.3%,the recall rate is about 13.1%,comparing the improved function of TF-IDF based on categories and location weight and named entities with the traditional function of TF-IDF.Use the improved function to extract the keywords from the background corpus, and then discover the hot topic of the test corpus by text clustering technology, the difference of effect is remarkable between the traditional TF-IDF function and the improved TF-IDF function. Experimental analysis shows that it has about 10% enhancement of the accurate rate and recall rate of hot topic discovery, when using the improved function of TF-IDF to extract the feature.The work of this text will be used widely in the aspect of hot topic tracking. |