Scrapy Framework-based Web Crawler Achieved Data Capture And Analysis

Posted on:2018-05-22

Degree:Master

Type:Thesis

Country:China

Candidate:Z J An

Full Text:PDF

GTID:2348330515996640

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of the information age and the popularity of programming technologies,search engines became the necessities of daily life.Most of the search engine crawler technology as a core module,return the user query results by keyword.But the explosive growth of information network,make it difficult to find and locate information.In response to these issues,this paper is based on Python and Scrapy environment,with "Sina weibo"For the study,in the study and analysis of the current crawler technology principles,core modules,and running processes on the basis of,to carry out an exploratory framework Scrapy Web crawler,completing goals such as data capture.First,simple crawler technology principle and development of crawler project introduced a number of key technologies,and introduced in this study has a profound impact of Cookie and the Robot Protocol.Secondly,by using the Python language based development Scrapy framework open source crawler to crawler development,points out the Mongo DB No Sql database,represented in the metadata stored in a significant role.Details the Scrapy development processes and implementation details of the reptile.Again,the crawler design key issues were discussed,we implement custom spider solution.Replace the Cookie and the user-agent spoofing is used as a way to break through the site limit.And multi-thread problem URL,use and analyze Scrapy’s solution.Finally,crawlers test and show results,problems and improvement of thinking possible.

Keywords/Search Tags:

Spider, Scrapy, URL deduplication, Python, Cookie

PDF Full Text Request

Related items

1	The Design And Implement Of Search Engine System On The Campus Networks Use Python-based Technology
2	Research And Implementation On WEB Data Mining Technology Based On Python
3	Research Of Web Application Security Vulnerabilities Mining
4	The Application And Research Of Chinese Word Segmentation And Web Deduplication In News Vertical Search Engine
5	The Implement And Improvement Of SYN Cookie On The Basis Of IPv6 Protocol
6	Design And Research Of Network Spider
7	Design And Development Of Distributed Crawler Based On Scrapy Framework
8	Production And Application Of Digital Label Based On Cookie
9	Scrapy-based Crawling And Characteristics Analysis Of An E-commerce Network
10	Evaluation Of Cookie Sameorigin Policy For Web Application Security