Font Size: a A A

The Research Of Performance Optimization Of Hadoop In Big Data

Posted on:2014-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y CaoFull Text:PDF
GTID:2248330398451961Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the technology of Internet and Web, audio, video, Web logs, Internet search index and the text files of Internet etc,which have been widely used have brought the sharply increase in amount of data, it indicates that the era of big data is coming. In the era of big data the data has characteristics of the data amount increasing sharply and the more complicated data structure, the situation like that can lead to the more difficulty in data storage and processing. While the emergence of the Hadoop greatly simplifies the problem of data storage and processing in the era of big data, so the paper researching on Hadoop and its’ optimization has important practical significance.The main research in this this paper is following. First of all, the theory of HDFS and MapReduce that are core technology of Hadoop which is studied and analyzed. And its’ study in detail considers the following several aspects:NameNode, DataNode, interface, class, call relationship, and analyzes the working mechanism of HDFS and MapReduce. Meanwhile, this paper aims at two performance problems which the Hadoop exists so far and then it puts forward a improved program preliminarily based on the in-depth study on source code. Secondly, this paper studies and analyses the poor performance of the Hadoop speculate execution algorithm in heterogeneous environme-nts which is the first performance problem. A new improved algorithm for the problem is put forward. The new algorithm can adjust the execution of backup task automatically according to system load condition to make it balanced, and getting more precise stragglers using the way of putting the task that residual time value that is based on historical average completion time putting forward by Zaharia is greater than0.2in queue. The new algorithm to a certain extent improves the performance of speculate execution in the heterogeneous environment. Finally, one of the second performance problem of Hadoop is that it brings about the defects of the performance when DBInputFormat processes huge amounts of data in a relational database. To solve the problem, improving DBInputFormat interface, putting forward a new sharding strategy and building the improved interface are done. It improves the efficiency and performan- ce of Hadoop processing the relational database.Building experiment platform and experimenting the the proposed algorithm and the improved interface, they are verified that they can, to some extent, improve the performance of Hadoop.
Keywords/Search Tags:Hadoop, MapReduce, Hadoop Speculate Execution, DBInputFormat
PDF Full Text Request
Related items