| Data acquisition and topic analysis is the key technology of online public opinion analysis, and has become a hot spot in the area of intelligent information processing. The technology collects online public opinion data from Internet, detects topics, and mines the topics with multiple points of view and deep degree by making use of the data cube model of online public opinion topics. The deep mining can truly describes the developing and changing process of online public opinion, which is significant for information supervision and information security. This dissertation studies data acquisition and topic analysis of online public opinion, including data collection, data extraction, topic detection and topic analysis. Some major contributions are listed as follows:(1) Special web crawlers are designed and realized for data collection of online public opinion. On the one hand, in order to increase the download speed, the basic framework of the general web crawler is improved by using asynchronous socket, DNS cache and multi-queues on the basis of analyzing the shortage of the general web crawler. On the other hand, this dissertation pays much attention to web news, web forum and web blog and designs special web crawlers to crawl target web sites respectively by analyzing the main spread mode of online public opinion. Each web crawler adopts corresponding strategy according to different sites to realize the precise crawling with the function of executing scripting and parsing RSS. Experimental results show that the special web crawler outperforms the general web crawler both in efficiency and accuracy.(2) A novel data extraction method based on the similarity of page layout is proposed. The new method deals with web pages in two different levels without manual work. Topic-blocks in page-level are recognized, and then metadata from topic-blocks in area-level is extracted by using the statistical characteristic of itself. Experimental results show that this method performs well in adjustability, precision and recall.(3) In order to make up the shortage of current topic detecting algorithms with high complexity and low accuracy, a new method based on hierarchical clustering is presented for topic detecting of online public opinion. Firstly, hierarchical clustering is adapted for certain amount documents to get some topics. Secondly, the following documents are detected based on the topics. Finally, the first step is executed again to the documents that are not divided to any topics. Experimental results show that this method is efficient with high accuracy, low miss and fault compared with the traditional methods.(4) A dada cube model of online public opinion topic is put forward with data warehouse technology. The data cube model contains the major components of online public opinion and can be easily extended according to the practical needs. Experimental results show that multiple points of view and deep degree mining can be done based on the data cube model. The analysis results can truly describes the developing and changing process of online public opinion, which is helpful for people to understand the online public opinion topic comprehensively with necessary information for online public opinion warning supported. |