Font Size: a A A

Research And Implementation Of Distributed Web Crawl Based On Hadoop Architecture

Posted on:2011-08-11Degree:MasterType:Thesis
Country:ChinaCandidate:J J ChengFull Text:PDF
GTID:2178360308461057Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nowadays, cloud computing has become one of the most important technology in the IT industry. The leading companies such as Google, IBM, Microsoft, Amazon and so on are developing their "Cloud Computing Platform" actively. In this situation, the lab of State Key Laboratory of Networking and Switching also need to develop its own "Cloud Computing Platform", which is mainly based on Hadoop. The project of this paper is one part of this cloud computing platform. The goal is to develop a distributed search engine based on Redhat EL5.2, distributed file system HDFS and distributed computing framework MapReduce. This paper explores the crawl part of this distributed search engine.This paper firstly discusses the basic technology in this crawl, which includes three main parts that are "Cloud Computing", "Hadoop Distributed Platform" and "the Principle of Web Crawler". In the part of "Cloud Computing" the author begins from the architecture of cloud computing and analyzes the service level and technical level of it. Then the author presents the Hadoop distributed platform that is the basic technical background of this paper. The Hadoop includes two core technologies, which are HDFS and MapReduce. At the same time, this article discusses the basic techniques of search engines and web crawler's basic principles. Also the author analyzes the prototype of distributed search engine Nutch.Based on the research, this paper puts forward the demand of this project and made a clear design for it. It determines the layout of the distributed crawler system, module division, and processing flow. According to these, the author designs the data structure and realizes them. Finally, this distributed crawl system is tested on large-scale clusters. From these test data, the author analyzes the advantages and disadvantages of this system. Moreover, it makes a plan for the future research.
Keywords/Search Tags:cloud computing, distributed search engine, crawler, Hadoop, HDFS, MapReduce
PDF Full Text Request
Related items