Font Size: a A A

Research On The Technology Of Incremental Web Pages Crawler

Posted on:2008-07-23Degree:MasterType:Thesis
Country:ChinaCandidate:C GongFull Text:PDF
GTID:2178360245997731Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Incremental crawler is an important research issue in the field of information retrieval. The aim of incremental crawler is to gather changed pages, new pages and died pages, in which the most important part is the new pages. This method can reduce the gathering period, and update the pages up to the minute, so it is widely used in large-scale search engines and vertical search engines. This thesis makes an intensive study of the three stages of incremental crawler for new pages: incremental crawling web pages'trees, the content pages'groups and the pruning for web pages'trees.Because the types of pages in different depth are different, the crawler can not gather pages according to depth in website. In the part of incremental crawling web pages'trees, this thesis proposes a gathering method based on web pages'trees. First, the crawler recognizes the index pages in websites, and gets the web pages'trees whose roots are index pages. By this way, the crawler divide pages into several sets of pages according to different updating period, and there is only one type of pages in the same depth, they can be gathered easily.In the part of gathering the content pages'groups, the crawler can only get one content page in content pages'group from the index page, the other pages in the same content pages'group are lost. So the gathering recall is not high. This paper introduces the definition of content pages'groups, the link relation of content pages'groups and the gathering method for content pages'groups.In the part of pruning for web pages'trees, because not every branch of the web pages'trees is new page, the crawler can get a high gathering precision by cutting the old branch by visited URL storage technology and date recognition technology in URL.By applying the technology mentioned above, the gathering precision and recall for new pages increase largely, the average precision achieves 92.12% and the average recall achieves 93.38%.
Keywords/Search Tags:incremental crawler, web pages'tree, content pages'groups, improved LRU, search engine
PDF Full Text Request
Related items