A Research On Statistic-based Classification Of Chinese News Web Page

Posted on:2008-10-08

Degree:Master

Type:Thesis

Country:China

Candidate:Y L Pang

Full Text:PDF

GTID:2178360242469989

Subject:Basic mathematics

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology, the online information shows explosive growth, and the Web has become a huge distributed information space owning several billion web pages. In these mass data resources, it is very difficult to obtain wanted information for the reason that information mostly in the semi-structured or unstructured form is disorganized, so it is very necessary to categorize the web pages in order that searching useful information could be easier. While manual categorization cannot cope with the mass online documents, it is very important to study automatic web page classification methods.The approaches for automatic classification of Chinese news web pages are studied in this thesis as follows.An approach for automatic information extraction of Chinese news web page is given. At first the not standardized labels in web pages will be fixed, and the web pages will be transferred into DOM trees. When extracting useful information, every line in linear texts get by travelling DOM trees will be labeled using Condition Random Field model and the bounds of kinds of information in page will be determined.According to that news web pages contain more structure information comparing to general text including title, metadata, content, and related links, the influence of these structure information to classification is studied, and a weighted method to combing them is proposed and then the classification performance is improved.The performances of several feature selection in classification of Chinese news web page are compared and analyzed. The experiment shows that Information Gain method achieves better performance than others, and LSI method can dramatically decreases the feature dimension needy by classification models while the classification performance is not reduced.

Keywords/Search Tags:

Web page classification, Web information extraction, Web structure information combination, feature selection

PDF Full Text Request

Related items

1	Research And Implementation Of A Web Information Extraction System Based On Semantic Structure Of The Website
2	The Research Of Web Pages Information Extraction Based On Page Structure Analysis Technique
3	Visual Web Page Information Extraction And Text Feature Word Extraction Technology Research
4	Research On Web Article Automatic Extraction Method Based On Page Segmentation
5	Research And Realization Of Term Selection In Chinese Web Page Classification Based On VSM
6	Reasersh On Internet Public Opinion Information Extraction And Classification
7	Research On Specialty Knowledge Retrieval Method Based On Web Information Extraction
8	A Study On Feature Design Algorithms With Application To Image Annotation And Information Extraction
9	The Design And Implement Of Web Page Automatic Categorization And Storage Management System
10	Research On Mining Structure Of WEB Page For Information Extraction