| With the rapid development of Internet technology, the online information shows explosive growth, and the Web has become a huge distributed information space owning several billion web pages. In these mass data resources, it is very difficult to obtain wanted information for the reason that information mostly in the semi-structured or unstructured form is disorganized, so it is very necessary to categorize the web pages in order that searching useful information could be easier. While manual categorization cannot cope with the mass online documents, it is very important to study automatic web page classification methods.The approaches for automatic classification of Chinese news web pages are studied in this thesis as follows.An approach for automatic information extraction of Chinese news web page is given. At first the not standardized labels in web pages will be fixed, and the web pages will be transferred into DOM trees. When extracting useful information, every line in linear texts get by travelling DOM trees will be labeled using Condition Random Field model and the bounds of kinds of information in page will be determined.According to that news web pages contain more structure information comparing to general text including title, metadata, content, and related links, the influence of these structure information to classification is studied, and a weighted method to combing them is proposed and then the classification performance is improved.The performances of several feature selection in classification of Chinese news web page are compared and analyzed. The experiment shows that Information Gain method achieves better performance than others, and LSI method can dramatically decreases the feature dimension needy by classification models while the classification performance is not reduced. |