Font Size: a A A

Scrapy Framework-based Web Crawler Achieved Data Capture And Analysis

Posted on:2018-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:Z J AnFull Text:PDF
GTID:2348330515996640Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of the information age and the popularity of programming technologies,search engines became the necessities of daily life.Most of the search engine crawler technology as a core module,return the user query results by keyword.But the explosive growth of information network,make it difficult to find and locate information.In response to these issues,this paper is based on Python and Scrapy environment,with "Sina weibo"For the study,in the study and analysis of the current crawler technology principles,core modules,and running processes on the basis of,to carry out an exploratory framework Scrapy Web crawler,completing goals such as data capture.First,simple crawler technology principle and development of crawler project introduced a number of key technologies,and introduced in this study has a profound impact of Cookie and the Robot Protocol.Secondly,by using the Python language based development Scrapy framework open source crawler to crawler development,points out the Mongo DB No Sql database,represented in the metadata stored in a significant role.Details the Scrapy development processes and implementation details of the reptile.Again,the crawler design key issues were discussed,we implement custom spider solution.Replace the Cookie and the user-agent spoofing is used as a way to break through the site limit.And multi-thread problem URL,use and analyze Scrapy’s solution.Finally,crawlers test and show results,problems and improvement of thinking possible.
Keywords/Search Tags:Spider, Scrapy, URL deduplication, Python, Cookie
PDF Full Text Request
Related items