Font Size: a A A

The Design And Implementation Of Web Crawler With Page Refresh Mechanism

Posted on:2009-06-06Degree:MasterType:Thesis
Country:ChinaCandidate:S YinFull Text:PDF
GTID:2178360242480504Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the information on net increases very fast. The size grows fast to 8 thousands million pages and 56 thousands million links in 2003 at least from several thousand pages since 1993.And now, the number of pages and links is head and shoulders above this statistical count. Moreover, the document on net is distributed, heterogeneous, non-structured or semi-structured. And people search the information which they need depending on net more and more, but the information is too much, and how to find the information that we need effectively is becoming a pivotal problem. The search engine is for solving this question. The Web Crawler is the base of search engine. With the development of many years, this technology is already used broad. But because of the competition among the search engine company, the detailed design is not open, moreover, the description of web crawler in literature is also too simple to re-implement. And collecting the new and change pages is one of the core work for search engine, that orders the web crawler must have the ability of page refresh and increment collecting. How to refresh, which policy we according to, how to make them work with a web crawler together, are the focus of this article.The first, this article introduces the background and early development of the crawler. And then it explains the structure and work process using Mercator for example. The work process can be summing up this follow: crawler starts with one or more seed URL, then get new URLs from the seed URL, at this processing, more URLs add to the work queue. It will work all the while until it meets the stop condition. This article introduces the technology of crawler including multi-node cooperation, URL selection, URL Frontier, duplicate URL eliminate, DNS parse, page collection and other interrelated work. At last, article sums up the characteristic of crawler as the following: Distributed, Scalable, High performance, Polite, Continuous, Extensible and Portable. And the problems of this task are how to select the download page, how to refresh page, how to meet weak politeness guarantee, how to parallel crawlers.The second, this article introduces the page refresh mechanism, and analyses several page refresh technology and algorithm. They are contiguity, the system set the original refresh time on the base of attribute for new crawler pages. if the page changes in refresh time, halve the time, if not double it; fixed interval, refresh page in fixed time; backtrack in time, if a page is continuous not changing, its refresh time follows the contiguity, but once it changed, the refresh will be write to a small value. And we get an inside into their characteristic, advantage and disadvantage. They are the base of the design of page refresh algorithm next.The third, this article gives a design and analysis of crawler system. The system has two main functions, crawling and page refresh. Our crawler program includes the following modules: 1) Download Module, it downloads pages with given URLs, it is multi-threads structure based on Thread Pool.2) Manage Module, it takes charge of the initialization, start, terminate of the system, and assorts with all the other modules.3) Store Module, it is a middle layer between system and physical database, and stores the download data to disk. It also has a cache for preventing I/O operating frequently. 4) Scheduler Module, it works as URL Frontier, but with more function, both the page refreshing and new download schedule.5) GUI Module, it gives the program a GUI, makes debug easily.6) Assistant Module, it includes some algorithms which are called frequently, and some enumeration.URL extracting, URL to MD5 encoding,MD5 encoding compare for example. This article also gives a design of database table. There is three tables. 1) Index Table, it is a hash table mapping URL and URL's MD5 encoding for increase efficiency of searching duplicate URL. 2) URL: Table, it stores the URL's information. 3) Page Table, it stores the download pages.This article also gives the way to solving the questions which have been referred to in the process of design. 1) Implement the multi-threads with a Thread Pool. 2) Encoding URL with MD5 encoding to compare for duplicate URL eliminate. and build 2-level cache for speeding. 3) There are multi-queues with different hostname in scheduler module, and only one job with the same hostname submitted at one time; it gives the weak politeness guarantee. 4) Regular expression is used for URL extracting. It's always efficacious and fast.About the page refresh mechanism, this article discusses three sides. 1) How to estimate the change of the page. In the article, we give a way that encoding the page with MD5 encoding to estimate changes, because the encoding is hypersensitive to page changes and it is fast with perfect performance. 2) Refresh algorithm selection. This article gives an improving contiguity algorithm, the algorithm is not changing the refresh time immediately when the page changes or not. It only changes the interval time by the history of page change. This algorithm is fast, and has good stabilization.At last, we implemented a crawler system with page refresh mechanism with C# language and MS SQL Server Database.
Keywords/Search Tags:Implementation
PDF Full Text Request
Related items