Font Size: a A A

Design And Implementation Of Distributed Books Web Crawler System

Posted on:2015-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:P C ZhaoFull Text:PDF
GTID:2268330428476151Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As Internet technology is developing rapidly, people’s lifestyle has gradually changed. Previously, reading book can be only acquired by printed books. But now media have changed. E-book has replaced the position of printed book largely. However, the quantity of e-book is growing rapidly. So, it is important how to select the useful book from the Internet.This thesis designed and implemented a distributed web books crawler system DScrapy base on the Scrapy framework. DScrapy can be used to download the book detail information and book file from the Internet. The data crawled from the Internet will be stored dividedly in MongoDB. User can use the instruction provided by MongoDB to manipulate the book data conveniently.Firstly, the open source framework of Scrapy was studied deeply. Scrapy is not designed for distributing crawl, and can be only used for single crawler. So, a new scheduler was designed for distributing crawl, which replaced the intrinsic scheduler in Scrapy. Then, a book pipeline had been designed for storing book cover, book detail information and book file.Secondly, a distributed web books crawler system DScrapy was implemented. Base on the design mentioned above, the coding and testing work were done as follows:(1) determining to use Linux as the development platform;(2) using XPath technology to extract information from web source code;(3) using Redis in-memory database to store URL to be crawled;(4) using shard MongoDB to store book detail information;(5) using GridFS file system to store book file.Thirdly, the test of DScrapy system was done on large web. The result shows that the system can be used for distributing crawl, can scatter a large task into multiple crawlers, and makes the crawling more and more efficient.Finally, the content of this thesis was summarized and the further researches were presented.
Keywords/Search Tags:Distributed, Book Crawler, Scrapy, Data Storage
PDF Full Text Request
Related items