Font Size: a A A

Hadoop Performance Optimization Based On Parameter Tuning

Posted on:2023-10-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:X L LuoFull Text:PDF
GTID:1528306851989499Subject:Agricultural IT
Abstract/Summary:PDF Full Text Request
The emergence of cloud computing has brought an unprecedented technological revolution to the information industry.Hadoop integrates various contents of cloud technology and becomes a comprehensive cloud computing service platform,showing the characteristics of scalability,fault tolerance and high efficiency.The expanding field of Hadoop applications has raised the performance requirements of Hadoop.On the basis of analyzing the implementation framework and working mechanism of Hadoop system,researchers put forward performance optimization measures from the aspects of data storage strategy,scheduling algorithm,parameter configuration and so on.This thesis studies the optimization strategy from the two aspects of configuration parameters and scheduling algorithm to improve the overall performance of Hadoop cluster.Map Reduce and HDFS constitute Hadoop.The parameter configuration of these two components can play a great role in cluster performance.There are many types of jobs running on the Hadoop platform.Different parameter configuration schemes need to be given according to the actual situation in order to achieve good operation results.This thesis focuses on the impact of configuration parameters based on HDFS and Map Reduce on high i/o load and high CPU load jobs,and puts forward corresponding optimization methods.In order to solve the parameter optimization problem of HDFS,a genetic simulated annealing algorithm is proposed by combining simulated annealing method with genetic algorithm to ensure that the genetic algorithm can find the optimal solution in the global space.Genetic simulated annealing algorithm can be used to select appropriate configuration parameters for high i/o load jobs and improve the execution speed of jobs.Experiments show that the proposed HDFS parameter optimization method based on genetic simulated annealing algorithm effectively reduces the completion time of jobs,and lays a foundation for the efficient execution of high i/o load jobs in Hadoop clusters.Aiming at the parameter optimization of Map Reduce,based on the poor performance of traditional PCM algorithm and the sensitivity to the initial matrix value,this thesis improves the particle swarm initialization method and proposes pso-pcm algorithm.The evolutionary state decision method is used to optimize the fitness value,and a two particle swarm optimization algorithm is proposed.By changing the value of the cluster center after each iteration,multiple clustering results are generated.Finally,the proposed PSO – PCM algorithm is applied to Map Reduce parameter optimization.Experiments show that the Map Reduce parameter optimization method based on PSO – PCM algorithm can improve the execution efficiency of high CPU load jobs in Hadoop clusters.A large number of studies have shown that when implementing Hadoop job scheduling,the bee colony algorithm has obvious advantages over the traditional FIFO scheduling,fair scheduling and computing capacity scheduling strategies,and effectively solves the problem of task scheduling time in cloud computing systems.Because the bee colony algorithm has the characteristics of premature and slow convergence,this thesis proposes an improved bee colony algorithm,which combines the bee colony algorithm with the k-means algorithm to propose the K-means bee colony algorithm.Experiments show that using k-means bee colony algorithm to realize job scheduling has obvious advantages in reducing job completion time and balancing load.In order to optimize the task timeout rate,system energy consumption and adaptive ability of the centralized batch scheduling model in Hadoop task scheduling,a Hadoop task scheduling model based on hierarchical load balancing algorithm is proposed in this thesis.According to the function type and performance of the server,the scheduling system is divided into multiple scheduling layers.Then,the simplest scheduling strategy is used to optimize the system energy consumption and reduce the scheduling time as much as possible.Experiments show that compared with Hadoop task scheduling model based on simulated annealing algorithm,Hadoop task scheduling model based on hierarchical load balancing algorithm has better performance in load balancing,task timeout rate,system energy consumption and self-adaptive ability.
Keywords/Search Tags:Hadoop performance optimization, Optimization of HDFS parameters based on genetic simulated annealing algorithm, Mapredcue parameter optimization based on pso-pcm algorithm, K-means bee colony algorithm
PDF Full Text Request
Related items