Font Size: a A A

Research On Vision Based Algorithm In Chinese Web-Page Classification

Posted on:2008-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:X DuanFull Text:PDF
GTID:2178360212994644Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Recently, the Vision based Page Segmentation algorithm (VIPS) has been widely researched, which simulates how a user understands web layout structure based on his visual perception. Many web applications such as information retrieval, information extraction and automatic web page classification can benefit from this algorithm. The automatic web page classification is one of important application fields of web page segmentation. The Chinese web page classification is a representative page classification problem and has been researched by numerous researchers. This paper focuses on the following aspects.Firstly, we compare the traditional page representation method based on DOM (Document Object Model) tree with the vision based method, which is an automatic top-down, tag-tree independent approach to detect web content structure and independent to the HTML documentation structure. The vision based page representation method takes advantages of visual cues to obtain the vision based content structure of a web page and thus successfully bridges the gap between the DOM tree structure and the semantic structure. The web page is segmented based on visual separators and structured as a semantic hierarchy, which is consistent with human perception to some extent.Secondly, I propose a block importance based Chinese web page classification method based on the VIPS. I utilize vision based page segmentation method to partition the Chinese web page and obtain several blocks with different weights (the importance of a web page) because of a large variety of noisy information in web page. Only the blocks with high weights, which can be utilized to classify the Chinese web pages and get the better results, can discriminate different semantics within a web page. The recall and precision are two important aspects in web page classification and measured by the parameter F\ in the experiment of this paper.The vision based page classification method is compared with the other one (Full Document based page classification method) in the experiment of this paper, in which the vision based page classification is better. Since the vision based page classification method integrates the hierarchical structure and semantic information of a page, it gains the better classification results. In the experiment, I utilize the Support Vector Machine (SVM) and K-Nearest Neighbour (KNN) as classifiers.The web applications such as information retrieval and Chinese web page classification can benefit from the vision based page classification method because of its better classification result. As one application of the vision based page classification, the image retrieval is introduced in this paper.
Keywords/Search Tags:vision based page segmentation, bIock importance, SVM, web page classification
PDF Full Text Request
Related items