The Research Of Performance Optimization Of Hadoop In Big Data

Posted on:2014-02-16

Degree:Master

Type:Thesis

Country:China

Candidate:Y Cao

Full Text:PDF

GTID:2248330398451961

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the technology of Internet and Web, audio, video, Web logs, Internet search index and the text files of Internet etc,which have been widely used have brought the sharply increase in amount of data, it indicates that the era of big data is coming. In the era of big data the data has characteristics of the data amount increasing sharply and the more complicated data structure, the situation like that can lead to the more difficulty in data storage and processing. While the emergence of the Hadoop greatly simplifies the problem of data storage and processing in the era of big data, so the paper researching on Hadoop and its’ optimization has important practical significance.The main research in this this paper is following. First of all, the theory of HDFS and MapReduce that are core technology of Hadoop which is studied and analyzed. And its’ study in detail considers the following several aspects:NameNode, DataNode, interface, class, call relationship, and analyzes the working mechanism of HDFS and MapReduce. Meanwhile, this paper aims at two performance problems which the Hadoop exists so far and then it puts forward a improved program preliminarily based on the in-depth study on source code. Secondly, this paper studies and analyses the poor performance of the Hadoop speculate execution algorithm in heterogeneous environme-nts which is the first performance problem. A new improved algorithm for the problem is put forward. The new algorithm can adjust the execution of backup task automatically according to system load condition to make it balanced, and getting more precise stragglers using the way of putting the task that residual time value that is based on historical average completion time putting forward by Zaharia is greater than0.2in queue. The new algorithm to a certain extent improves the performance of speculate execution in the heterogeneous environment. Finally, one of the second performance problem of Hadoop is that it brings about the defects of the performance when DBInputFormat processes huge amounts of data in a relational database. To solve the problem, improving DBInputFormat interface, putting forward a new sharding strategy and building the improved interface are done. It improves the efficiency and performan- ce of Hadoop processing the relational database.Building experiment platform and experimenting the the proposed algorithm and the improved interface, they are verified that they can, to some extent, improve the performance of Hadoop.

Keywords/Search Tags:

Hadoop, MapReduce, Hadoop Speculate Execution, DBInputFormat

PDF Full Text Request

Related items

1	The Optimization Of Scheduling Algorithm And Download Hadoop Platform Mechanism
2	Research On The Performance And Optimization Of MapReduce Model In Hadoop Platform
3	Research On Scheduling Algroithm In Hadoop Mapreduce
4	Research On Improving The Fault Tolerance Performance In MapReduce
5	MapReduce Speculation Execution Algorithm In Heterogeneous Environments
6	Research On Task Scheduling Algorithms Based On Pre-Release Resource List In Hadoop
7	The Mapreduce Model In The Hadoop Implementation Of Performance Analysis And Optimization Improvements
8	The Research Of MapReduce Job Scheduling Algorithm Based On The Hadoop Platform
9	Design Of Mapreduce Task Scheduling Algorithms In Heterogeneous Hadoop Cluster
10	The Performance Optimization And Improvement Of MapReduce In Hadoop