Design And Implementation Of Chinese Webpage Automatic Collection And Classification

Posted on:2011-01-16

Degree:Master

Type:Thesis

Country:China

Candidate:H B Yu

Full Text:PDF

GTID:2178360308961188

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of science and technology, we have entered the digital information age. Internet, which is seen as the world's largest information database, becomes the main tool of obtaining information. It is a major problem to be solved urgently how to quickly and accurately from the mass of information resources to find the information that users need because the network of information resources has a massive, dynamic, heterogeneous, semi-structured characteristics, and the lack of a unified organization and management presents. Web information-based collection and classification becomes the research hotspot.The goal of traditional Web information collection is to gather information as much as possible, or even the whole resources on the Web. The order and topic pages arenot cared about in the process of collecting. the page contents is too cluttered, and a large part of them is sparingly used so that system resources and network resources are wasted. This requires effective collection method used to reduce the collected page clutter and duplication. The web pages are automaticaly classificated to create effective and efficient search engine. Organization and management of web page classification is an effective means of information, which can solve a large extent the phenomenon of information clutter and facilitate users to accurately locate the information they need. However, the traditional mode of operation is manual. With the rapid increasing of all kinds of information in the Internet, manual way to handle alone is unrealistic. Therefore, Web classification is not a method with great practical value, but also is an effective means of organizing and managing data. Tt is an important research part of this paper.Firstly, the topic background, purpose and research status are introduced, and the theories, techniques and algorithms of web page collection and classification are described, which includs web crawler technology, duplicated web pages deletcion technology, information extraction technology, Chinese word segmentation, feature extraction techniques and web page classification technology. A comprehensive comparison of several typical algorithms is made, topical crawler and KNN classification is selected because they have outstanding performance. The proposed acquisition and classification of Chinese web are designed and implementated after these technologies are combined and the structure and characteristics of Chinese language web page are analyzed. Finally, it is coded and realized by the programming language. Test results that the system met the design requirements, and application are done in many feilds.

Keywords/Search Tags:

web information collection, webpage classification, information extraction, segmentation, character extraction

PDF Full Text Request

Related items

1	Page Events Information Extraction
2	Design And Implementation Of Education News Webpage Information Extraction System
3	Research On Information Extraction And Full Text Retrieval Of Crop Diseases Articles
4	Design And Implementation Of Content-based Webpage Collection And Classification System
5	User Web Information Collection And Analysis System Based On The Smart Router
6	For Internet Access To Multiple Information Technology Research
7	The Research And Design Of Network Information Monitoring And Analysis System
8	The Personal Information Extraction Based On Webpage Understanding
9	Research And Implement Of Web Information Intelligence Collection And Classification
10	The Design And Implementation Of Chinese Webpage Classification And Storage System