| In recent years,the Internet pages of the index into the exponential explosion,in such a large data age background,because the stand-alone crawler’s own limited computing and storage capacity has been far from being able to meet the data crawl,with the distributed technology The rise of the platform,for this difficult problem brought the Gospel.The network crawler system and distributed platform can be a perfect solution to large-scale web page crawling and storage and a series of difficulties.So in the context of large data age,for the Hadoop platform and crawler system combined with the crawler system research is very valuable.Here are the key algorithms in the distributed reptile field(task scheduling algorithm and URL de-realgorithm)in-depth analysis,found their flaws,and then in the distributed platform for its improvement and optimization.Task scheduling algorithm is a very critical algorithm in distributed crawler,if the task is not properly allocated,will seriously reduce the crawl efficiency of the cluster.In the third chapter,a more efficient task scheduling algorithm based on weighted rotation is analyzed,and a dynamic weighted rotation task scheduling algorithm with feedback is proposed,which overcomes the fixed weight of the weighted rotation task scheduling The negative impact is that the system achieves a good load balancing.URL re-algorithm is a serious constraint on the efficiency of crawling an algorithm,if the algorithm is not good,you will appear to crawl the same page,there may be into an infinite loop.Bloom Filter based on the re-use is not the storage element itself,saving a lot of storage space,which is particularly important in large data today,and its insertion and query elements complexity is very low,and its inside the bit array data structure is also very easy to accomplish.But there is a problem with its misjudgment.In the fourth chapter,the author analyzes the standard Bloom Filter and puts forward the MBF filter and applies it to the Hadoop distributed reptile system.In this paper,In the full play the advantages of Bloom Filter at the same time effectively reduce the Bloom Filter caused by the misjudgment of the problem.Finally,based on the two key algorithms improved in Chapter 3,the distributed network crawler system is designed and implemented.The distributed crawler system is implemented by requirement analysis,process analysis,the outline design of the system(including physical framework design and logical framework design),module design and data storage structure.Finally,the distributed crawler system is tested. |