Font Size: a A A

Design And Implementation Of Chinese Webpage Automatic Collection And Classification

Posted on:2011-01-16Degree:MasterType:Thesis
Country:ChinaCandidate:H B YuFull Text:PDF
GTID:2178360308961188Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology, we have entered the digital information age. Internet, which is seen as the world's largest information database, becomes the main tool of obtaining information. It is a major problem to be solved urgently how to quickly and accurately from the mass of information resources to find the information that users need because the network of information resources has a massive, dynamic, heterogeneous, semi-structured characteristics, and the lack of a unified organization and management presents. Web information-based collection and classification becomes the research hotspot.The goal of traditional Web information collection is to gather information as much as possible, or even the whole resources on the Web. The order and topic pages arenot cared about in the process of collecting. the page contents is too cluttered, and a large part of them is sparingly used so that system resources and network resources are wasted. This requires effective collection method used to reduce the collected page clutter and duplication. The web pages are automaticaly classificated to create effective and efficient search engine. Organization and management of web page classification is an effective means of information, which can solve a large extent the phenomenon of information clutter and facilitate users to accurately locate the information they need. However, the traditional mode of operation is manual. With the rapid increasing of all kinds of information in the Internet, manual way to handle alone is unrealistic. Therefore, Web classification is not a method with great practical value, but also is an effective means of organizing and managing data. Tt is an important research part of this paper.Firstly, the topic background, purpose and research status are introduced, and the theories, techniques and algorithms of web page collection and classification are described, which includs web crawler technology, duplicated web pages deletcion technology, information extraction technology, Chinese word segmentation, feature extraction techniques and web page classification technology. A comprehensive comparison of several typical algorithms is made, topical crawler and KNN classification is selected because they have outstanding performance. The proposed acquisition and classification of Chinese web are designed and implementated after these technologies are combined and the structure and characteristics of Chinese language web page are analyzed. Finally, it is coded and realized by the programming language. Test results that the system met the design requirements, and application are done in many feilds.
Keywords/Search Tags:web information collection, webpage classification, information extraction, segmentation, character extraction
PDF Full Text Request
Related items