| Micro-blog hot topics discovery is to dig out topics from lots of micro-blog,and hot topics based on the topic of heat evaluation method.It can help people easily select the information of interest or need from a large amount of information,it also plays an important role in government public opinion guidance,information security,financial judgments and other fields.This paper analyzed and summarized the current status of the research with the micro-blog hot topics,we found the existing questions,including text segmentation error rate is high and the thematic words extraction accuracy is not high and the selection of hot topics evaluation methods is different.To solve these problems,this paper focuses on the following aspects:Firstly,the Chinese word segmentation and the new word discovery technology are researched in detail.It is found that the current word segmentation tool will present a lot of word fragments.Especially after the new word segmentation,resulting in very different from the original intention.In this paper,we propose a new method based on rules and N-gram models to solve the problem of high error rate.First consider the word structure rules to construct library of fragments,and then use Bi-gram and Tri-gram model to extract the candidate strings in the word fragments library,selected in the two mode candidate strings are larger probability for words,finally the organic combination of segmentation and new word.The experimental results show that this algorithm can effectively improve the effect of new word of micro-blog segmentation.Secondly,aiming at solve the problem that the extraction accuracy of topic words is not high,this paper combines the advantages of the TF-IDF algorithm and the word co-occurrence model,and propose an algorithm based on the optimization of TF-IDF algorithm and word co-occurrence model to extract the key words.In the research of the TF-IDF algorithm,it is found that the traditional TF-IDF algorithm does not reflect the position information of the word,In order to effectively reflect the importance of words,the words belong to the micro-blog text,title and the location information of comment is set into the data,and give them each different weight to optimize the TF-IDF algorithm.On the basis of above,the word co-occurrence model is used to extract the topic words.The experimental results show that this algorithm can reduce the deviation of the keyword extraction and make the result more accurate.Thirdly,based on the study of the structure and topic propagation law of micro-blog,this paper selects the factors that affect the user characteristics and the theme words as the hot topics,and use them to design the topic of heat value calculation formula to calculate the heat value of each topic,at last,select the hot topic of micro-blog according to the threshold of heat value.The experimental results show that this algorithm is more consistent with the micro-blog hot topic and the actual situation. |