| As one of the most typical forms of Web2.0, blog has become an important platform to transfer information and express opinion with the development of the internet technology. It combines personal space with social space very well and it is playing an important role in people's political, economical and cultural life. Therefore, it is very worth researching on how to find, organize, retrieve and make use of the rich resources of blog effectively as well as mine the valuable information. Blog can be divided into two types: portal website blog and individual independent blog. Most of the time, the number of the recommended bloggers of the two kinds of blog above is very limited because of the low rate of exposure. Most of the blogs can be only exposed to readers relying on the search engineer. It can not meet the real-time need and the comprehensive need of internet users when they are fetching the related information of the blogs. This thesis mainly focused on mining the information related to the portal website bloggers, their friends and their visitors. Meanwhile, this thesis designed and implemented a kind of algorithm to identify blog home page basing on classification in connection with the portal blog, the fake blog and personal independence blog.(1) The thesis established the system of blog discovery which catered to the portal website. The number of the link which is crawled by the traditional Web crawler is much smaller than the actual number which is showed on the website page because of the wide application of Ajax technique in the blog. This system in my thesis contained a focused crawler which solved the problem above well by obtaining the data of Ajax (Asynchronous JavaScript and XML) on the portal websites.(2) The thesis designed and implemented the algorithm of blog page recognition. This algorithm regarded the recognition of blog home page as a classification problem. In connection with the blog home of portal website, the fake blog home and the personal independent blog home, this thesis extracted many kinds of features which were closely related with the blog such as the features of HTML, URL, Text, the DOM tree depth and the content of anchor text. On this basis, the thesis analyzed the different functions of the three typical classification algorithms in identifying the blog home page. This thesis established the system of blog auto discovery and made the evaluation of the system performance. Experimental results showed that: when the blog website used the Ajax technology, the designed system in this thesis had the stronger ability of mining bloggers, comparing with the traditional crawler. On the basis of analyzing the set of the blog webpage which was obtained by the blog discovery system, this thesis designed and implemented the blog home classification system. Meanwhile, this thesis made a contrast among the Naive Bayesian algorithm, the Decision Tree algorithm and SVM algorithm. The SVM algorithm obtained the best performance. The precision, the recall rate and the Micro-F1 of the blog home have reached 98%, 95% and 96% respectively. |