| With the fast development of Internet technology and the improvement of people's living standard,online e-commerce is booming,which contributes to the rise of various online shopping platform.In recent years,the prosperous of B2 C shopping pattern,represented by Jingdong and Tmall,cause a big trouble for enterprise development and user's choice with its increasing volume of work and amount of information.How to use search engine technology to get the mass data? And how to find the real demand of user from the mass data? These problems have been hot and difficult points in the field of e-commerce recently.Therefore,using anti-crawler technology and data driven mode to dig user's preference precisely is an important guarantee for the B2 C shopping platform to carry out precise marketing.However,with the increasing information of Internet and innovation of anti-crawler technology,traditional crawler technology is hard to meet the demand of mass data collecting,and its limits are as follows: firstly,traditional crawler technology cannot finish the task of massive commodity data collection;secondly,traditional crawler technology,lack of “inspiration”,which is hard to skip the crawler block,such as Human-Machine Interaction and Fingerprint identification;thirdly,traditional crawler technology is slow at speed and has a long run time in terms of data collection.These problems seriously affect the study of DM(Data Ming).Thus,the distributed crawler technology based on Ant Colony Algorithm is created and put into use,which is regarded as a potential solution to the lack of “inspiration” of traditional crawler technology.This paper studies data collection and web anti-crawler based on main e-commerce sites.To begin with,this paper introduces the basic theory of Search Engine,rationale of web crawler,theory of Ant Colony Algorithm,distributed crawler technology,anti-crawler technology and code recognition technology in detail,thus brings the distributed crawler mode.In the second part,it studies the Scrapy-Redis model of the distributed crawler model.After that,this paper focuses on Ant Colony Algorithm model and code recognition theories,thus puts forward the distributed crawler based Ant Colony Algorithm.Besides,it also analyzes related theories in depth and finds out the anti-crawler's callback address and characteristic of e-commerce platform from its running log,which can be used to direct traditional crawler.At last,based on the Ant Colony Algorithm of distributed crawler collecting system,this paper aims to dig the data of e-commerce by using Python.By contrasting the differences between the distributed crawler on the basis of Ant Colony Algorithm and traditional crawler,this paper finds out traditional crawler is unknown to the overall distribution of information resources on e-commerce websites,thus fails to predict crawling direction and solve crawling trap.The research data also proves that the distributed crawler technology on the basis of Ant Colony Algorithm can direct traditional crawler better. |