Research And Implementation Of Distributed Web Crawl Based On Hadoop Architecture

Posted on:2011-08-11

Degree:Master

Type:Thesis

Country:China

Candidate:J J Cheng

Full Text:PDF

GTID:2178360308461057

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Nowadays, cloud computing has become one of the most important technology in the IT industry. The leading companies such as Google, IBM, Microsoft, Amazon and so on are developing their "Cloud Computing Platform" actively. In this situation, the lab of State Key Laboratory of Networking and Switching also need to develop its own "Cloud Computing Platform", which is mainly based on Hadoop. The project of this paper is one part of this cloud computing platform. The goal is to develop a distributed search engine based on Redhat EL5.2, distributed file system HDFS and distributed computing framework MapReduce. This paper explores the crawl part of this distributed search engine.This paper firstly discusses the basic technology in this crawl, which includes three main parts that are "Cloud Computing", "Hadoop Distributed Platform" and "the Principle of Web Crawler". In the part of "Cloud Computing" the author begins from the architecture of cloud computing and analyzes the service level and technical level of it. Then the author presents the Hadoop distributed platform that is the basic technical background of this paper. The Hadoop includes two core technologies, which are HDFS and MapReduce. At the same time, this article discusses the basic techniques of search engines and web crawler's basic principles. Also the author analyzes the prototype of distributed search engine Nutch.Based on the research, this paper puts forward the demand of this project and made a clear design for it. It determines the layout of the distributed crawler system, module division, and processing flow. According to these, the author designs the data structure and realizes them. Finally, this distributed crawl system is tested on large-scale clusters. From these test data, the author analyzes the advantages and disadvantages of this system. Moreover, it makes a plan for the future research.

Keywords/Search Tags:

cloud computing, distributed search engine, crawler, Hadoop, HDFS, MapReduce

PDF Full Text Request

Related items

1	Key Technology Study On The Cloud Computing Platform In The Field Of Search Engine
2	Research And Implementation Of Distributed Web Crawler
3	Study Based On Hadoop Distributed Web Crawler
4	The Research And Application Of Search Engine Based On Hadoop
5	Research And Application Of The Characteristics Of Distributed Computing Of OSS/BSS In The Cloud Deployment
6	The Research On Web Crawler Technology Based On Distributed Calculation
7	The Design Of The Cloud Computing System Based On Hadoop
8	Design And Implementation Of Vertical Search Engine Based On Hadoop
9	Optimization And Application Research Of MapReduce Computing Model Based On Hadoop
10	The Cloud Computing Based On Hadoop Platform And Log Analysis