Font Size: a A A

A Research On Statistic-based Classification Of Chinese News Web Page

Posted on:2008-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y L PangFull Text:PDF
GTID:2178360242469989Subject:Basic mathematics
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, the online information shows explosive growth, and the Web has become a huge distributed information space owning several billion web pages. In these mass data resources, it is very difficult to obtain wanted information for the reason that information mostly in the semi-structured or unstructured form is disorganized, so it is very necessary to categorize the web pages in order that searching useful information could be easier. While manual categorization cannot cope with the mass online documents, it is very important to study automatic web page classification methods.The approaches for automatic classification of Chinese news web pages are studied in this thesis as follows.An approach for automatic information extraction of Chinese news web page is given. At first the not standardized labels in web pages will be fixed, and the web pages will be transferred into DOM trees. When extracting useful information, every line in linear texts get by travelling DOM trees will be labeled using Condition Random Field model and the bounds of kinds of information in page will be determined.According to that news web pages contain more structure information comparing to general text including title, metadata, content, and related links, the influence of these structure information to classification is studied, and a weighted method to combing them is proposed and then the classification performance is improved.The performances of several feature selection in classification of Chinese news web page are compared and analyzed. The experiment shows that Information Gain method achieves better performance than others, and LSI method can dramatically decreases the feature dimension needy by classification models while the classification performance is not reduced.
Keywords/Search Tags:Web page classification, Web information extraction, Web structure information combination, feature selection
PDF Full Text Request
Related items