Research On Optimization Of Hadoop Distributed Web Crawler System

Posted on:2018-06-13

Degree:Master

Type:Thesis

Country:China

Candidate:T Zhang

Full Text:PDF

GTID:2348330563952650

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In recent years,the Internet pages of the index into the exponential explosion,in such a large data age background,because the stand-alone crawler’s own limited computing and storage capacity has been far from being able to meet the data crawl,with the distributed technology The rise of the platform,for this difficult problem brought the Gospel.The network crawler system and distributed platform can be a perfect solution to large-scale web page crawling and storage and a series of difficulties.So in the context of large data age,for the Hadoop platform and crawler system combined with the crawler system research is very valuable.Here are the key algorithms in the distributed reptile field(task scheduling algorithm and URL de-realgorithm)in-depth analysis,found their flaws,and then in the distributed platform for its improvement and optimization.Task scheduling algorithm is a very critical algorithm in distributed crawler,if the task is not properly allocated,will seriously reduce the crawl efficiency of the cluster.In the third chapter,a more efficient task scheduling algorithm based on weighted rotation is analyzed,and a dynamic weighted rotation task scheduling algorithm with feedback is proposed,which overcomes the fixed weight of the weighted rotation task scheduling The negative impact is that the system achieves a good load balancing.URL re-algorithm is a serious constraint on the efficiency of crawling an algorithm,if the algorithm is not good,you will appear to crawl the same page,there may be into an infinite loop.Bloom Filter based on the re-use is not the storage element itself,saving a lot of storage space,which is particularly important in large data today,and its insertion and query elements complexity is very low,and its inside the bit array data structure is also very easy to accomplish.But there is a problem with its misjudgment.In the fourth chapter,the author analyzes the standard Bloom Filter and puts forward the MBF filter and applies it to the Hadoop distributed reptile system.In this paper,In the full play the advantages of Bloom Filter at the same time effectively reduce the Bloom Filter caused by the misjudgment of the problem.Finally,based on the two key algorithms improved in Chapter 3,the distributed network crawler system is designed and implemented.The distributed crawler system is implemented by requirement analysis,process analysis,the outline design of the system(including physical framework design and logical framework design),module design and data storage structure.Finally,the distributed crawler system is tested.

Keywords/Search Tags:

Hadoop, Web Crawler, Task scheduling, Eliminate URL duplication

PDF Full Text Request

Related items

1	Duplication-based Scheduling Algorithm For Parallel Tasks On Heterogeneous Cluster
2	Task-duplication And Insertion Based Scheduling Algorithm For Heterogeneous Computing Environments
3	Researches On Task Placement And Task Duplication Methods Based On Reconfigurable Systems
4	Research Of Multi-core Processor Scheduling Algorithm Based On Task Clustering And Duplication
5	Research On Task Duplication Based Multi-core Scheduling Algorithm
6	QoS Cloud Workflow Scheduling Algorithm Based On Critical Task Duplication
7	Duplication And Dynamic-Priority Based Scheduling Algorithm In Grid Computing Systems
8	Research On Task Scheduling Algorithms Based On Pre-Release Resource List In Hadoop
9	Research Of Task Scheduling Algorithms For Heterogeneous Computing Environment
10	Study On Computing Task Scheduling Optimization Based On Hadoop Job