Blog Auto Discovery Approach

Posted on:2011-08-04

Degree:Master

Type:Thesis

Country:China

Candidate:Y Bi

Full Text:PDF

GTID:2178330338489566

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

As one of the most typical forms of Web2.0, blog has become an important platform to transfer information and express opinion with the development of the internet technology. It combines personal space with social space very well and it is playing an important role in people's political, economical and cultural life. Therefore, it is very worth researching on how to find, organize, retrieve and make use of the rich resources of blog effectively as well as mine the valuable information. Blog can be divided into two types: portal website blog and individual independent blog. Most of the time, the number of the recommended bloggers of the two kinds of blog above is very limited because of the low rate of exposure. Most of the blogs can be only exposed to readers relying on the search engineer. It can not meet the real-time need and the comprehensive need of internet users when they are fetching the related information of the blogs. This thesis mainly focused on mining the information related to the portal website bloggers, their friends and their visitors. Meanwhile, this thesis designed and implemented a kind of algorithm to identify blog home page basing on classification in connection with the portal blog, the fake blog and personal independence blog.(1) The thesis established the system of blog discovery which catered to the portal website. The number of the link which is crawled by the traditional Web crawler is much smaller than the actual number which is showed on the website page because of the wide application of Ajax technique in the blog. This system in my thesis contained a focused crawler which solved the problem above well by obtaining the data of Ajax (Asynchronous JavaScript and XML) on the portal websites.(2) The thesis designed and implemented the algorithm of blog page recognition. This algorithm regarded the recognition of blog home page as a classification problem. In connection with the blog home of portal website, the fake blog home and the personal independent blog home, this thesis extracted many kinds of features which were closely related with the blog such as the features of HTML, URL, Text, the DOM tree depth and the content of anchor text. On this basis, the thesis analyzed the different functions of the three typical classification algorithms in identifying the blog home page. This thesis established the system of blog auto discovery and made the evaluation of the system performance. Experimental results showed that: when the blog website used the Ajax technology, the designed system in this thesis had the stronger ability of mining bloggers, comparing with the traditional crawler. On the basis of analyzing the set of the blog webpage which was obtained by the blog discovery system, this thesis designed and implemented the blog home classification system. Meanwhile, this thesis made a contrast among the Naive Bayesian algorithm, the Decision Tree algorithm and SVM algorithm. The SVM algorithm obtained the best performance. The precision, the recall rate and the Micro-F1 of the blog home have reached 98%, 95% and 96% respectively.

Keywords/Search Tags:

blog discovery, Ajax, blog identification, SVM algorithm

PDF Full Text Request

Related items

1	The Design And Implementation Of Blog Management System Of Campus Based On Ajax
2	The Design And Implementation Of Blog Management System Of Campus Based On AJAX
3	Research And Implementation Of Ajax Technology In The Blog System
4	The Application Of Ajax Technology In Blog System
5	An Enterprise Edition Blog System Applications
6	Design And Implementation Of Campus Micro Blog System
7	Research On Blog Friends Recommendation Mechanism In Blogsphere
8	Research On Blog Friends Recommendation Mechatism In Blogsphere
9	Discovery Of Implied Communities For Blog Page Based On Topic
10	Realization Of Blog Information Collection And The Calculation Of Support Degree Of Hot Topic