Font Size: a A A

Design And Implementation Of Distributed Web Crawler System Supporting Dynamic Web Pages Paring

Posted on:2018-12-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y C OuFull Text:PDF
GTID:2428330545961122Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Web crawler technology provides the most convenient way to get data from internet.It is not only the base of traditional search engine companies,but also a smart tool of getting web data in the big data era.In recent years,with rapid development of internet technology and exponential growth of web data,stand-alone web crawler technology is no longer practical when faces with the demand for obtaining massive web data.Therefore,distributed web crawler technology becomes a prerequisite for obtaining massive web data.With development of web front-end technology,enhancement of against web crawler technology,and increase of system complexity,there are some problems in crawling data:the problems about obtaining data of dynamic web pages,against web crawler and stability of cluster.These problems seriously affect the efficiency of the web crawler to obtain target data(the data that users need).Based on distributed web crawler,this thesis studies and designs solutions for above problems.The main works of this thesis are as follows:(1)Obtaining data of dynamic web pages:There are a large number of asynchronous loading web pages in the internet,which is not directly available to web crawler.Currently,methods of program slice[2]have some disadvantages which cause obtaining data with high complexity and low precision.Invoking browser API methodes[15][16]have a defect which causes crawling data with low efficiency.In this thesis,a rule base method is proposed to classify web pages,which can obtain data of dynamic and static web pages,avoid disadvantages with high complexity and low precision,and improve the efficiency of crawling data.(2)Coping with against web crawler:Most companies will protect data of their websites,using various technical methods which prevent web crawlers from grasping arbitrarily.At present,the strategiesg[13][18]of solving the against web crawler cannot cope with the large amount of data crawled from a small number of web sites.This thesis designs IP agent Pool strategy and humanoid crawling data strategy,which adopt methods of randomly changing proxy IP and imitating people to visit target sites,which can bypass the mechanism of against web crawler.(3)Stability of cluster:When complex cluster system is running for a long time,the nodes of cluster can not work properly or exit the cluster system because of problems of the network or its own cluster system.At present,the strategies[15][48]of maintaining cluster stability has some defects such as simple function and missing tasks.In this thesis,we design strategies of heartbeat detection and recording missing task,which solve problems such as abnormal work of nodes,loss of line and task losing by monitoring the various messages sent by nodes and detecting the missing tasks.(4)Economic costs:the crawler system designed in this thesis is based on a flexible,customizable open source framework that allows you to freely remove unnecessary functionality to implement a lightweight system.And the lightweight system allows cluster nodes to run on raspberry factions(only a small computer with a credit card size and its system based on Linux),to build a cluster to reduce the economic costs,so that the system is more practical for the budget is relatively tight research institutions,as well as small or medium-sized enterprises.Finally,by setting a contrast experiment,the 40-node distributed crawler cluster designed in this thesis improves performance by at least 3-4 times and saves 30%cost by comparing with a stand-alone 40-thread crawler of reference experiment;in aspect of obtaining data of dynamic web pages,a contrast experiment is made between strategy designed in this thesis and a reference strategy[15],and the experiment shows that consumption time is reduced by about 39%;in aspect of system stability and against web crawler,we set up a test experiment,and the result shows that the system can crawl numerous tasks successfully,and has a strong robustness.Comprehensive experimental results show that the expected results are achieved and have good performance.
Keywords/Search Tags:Distributed web crawler, Obtaining data of dynamic web pages, Against web crawler
PDF Full Text Request
Related items