| With the increase of information on the Web, a huge list of resultant web documents will return if searching a query by a search engine. It is difficult to find desired information quickly out of the retrieved documents. One way to solve this problem is to classify web documents according to various criteria. Most web classification has been focused on a subject or a topic of a web document. However, sometimes users may be inclined to want the documents with certain genre. Therefore, the web genre is another criterion to classify web documents. However, this technology is not mature enough; especially Chinese web genre classification is at the start.This paper gives the comprehension of web genre classification based on the in-depth research. Web genre classification is a taxonomy that incorporates the style, form and content of the document, which allows multiple genres classification and maps multiple genres to a single document.The major contribution of this paper is to put forward the automatic system of Chinese web genre classification. The first step is to choose web genres and to build up the relevant web corpus based on CWT200g, which is the evaluation platform of SEWM2006. The second step is to get feature sets which contain the features extract from URL and the style, form and content information of web documents. In this step, parametric distribution method is used to evaluate the feature in order to remove the irrelevant features.The system uses SVM to classify the corpus. Two feature sets are designed to finish the comparison of classification. The conclusion is that the precision based on surface features is as successful as the precision based on deeper structural properties. The experiment achieves better result, which proves that genre classification of Chinese web pages is feasible and has theoretical value. |