Font Size: a A A

Research On Performance Optimization Of MapReduce Jobs Under Erasure Coding Schemes

Posted on:2023-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y YangFull Text:PDF
GTID:2558306905993929Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the explosive growth of Internet data,many distributed storage systems have integrated erasure coding mechanisms to ensure data reliability,while further reducing storage overhead.However,erasure coding has changed the organization of data in the storage system,which affects the data access of other services of the cluster.Take the Striped-Erasure Coding strategy implemented in Hadoop as an example.When a stripe is encoded,the original data is divided into finer-grained units compared to the replication scheme,which are dispersed in all nodes where the stripe is located.When the upper-layer applications such as MapReduce jobs access a piece of continuous content of the original data,the task needs to read data from multiple nodes.Furthermore,when the Hadoop system is deployed in a heterogeneous cluster or the workload on some nodes varies greatly,this "one-to-many" data access mode may degrade the performance of tasks due to the transmission delay of some straggler nodes,which increases the completion time of MapReduce jobs.In addition,the heterogeneous environment of the cluster and the difference in the runtime load of the nodes will also cause differences in the running efficiency of each task within the MapReduce job.Therefore,to improve the service performance under the erasure coding scheme for the upper-layer MapReduce applications,this paper proposes a MapReduce job optimization strategy based on heterogeneous clusters.The core idea of our strategy is to modify the data placement on the storage side and the task allocation、job scheduling on the computing side separately.In this way,we can reduce the negative impact of the heterogeneous environment on the data access performance under the erasure coding scheme,and at the same time avoid amplifying the difference in the load among nodes due to the static task allocation.The main contributions are as follows:(1)Job-friendly data placement under erasure decoding schemesDifferent from the default random data placement strategy in Hadoop,we propose a data placement strategy based on node hardware information and long-term background load,named Heterogeneous-aware Data Placement Algorithm(HDPA).By analyzing the data access performance of each node in the cluster,nodes with similar access performance are divided into the same group.When the data is written to Hadoop with an erasure coding scheme,HDPA places the data blocks of the same stripe in a set of nodes with similar performance,so as to avoid straggler nodes when the stripe is accessed,and improve the access speed of erasure-coded data for MapReduce jobs.(2)Load-balanced task allocation in heterogeneous environmentTo avoid the problem of load skew caused by the default random task allocation of Hadoop,this paper proposes a task allocation strategy named Dynamic Task Allocation Algorithm(DTAA)in the general heterogeneous cluster environment.By introducing a performance analysis module into the computing node,the number of computing units available for MapReduce jobs on each node is dy namically adjusted,so as to control the overall task concurrency of the cluster.In this way,we can avoid serious long-tail tasks due to the high load on some nodes,which will affect the performance of MapReduce jobs.(3)Load-balanced job scheduling in multi-job concurrent scenariosThe existing scheduling policies in Hadoop do not take into account the load balance of the underlying data access,and the static resource partitioning configuration cannot be well adapted to the heterogeneous environment of the cluster and the changing load.In this paper,we propose a heterogeneous-aware fair job scheduling strategy named Dynamic Balanced Fair Scheduling Algorithm(DB-Fair).By analyzing the storage information of the dataset for each job under the HDPA group placement strategy in the scheduler,and the available computing units of each node under the DTAA task allocation strategy,DB-Fair dynamically determines the number and priority of tasks that each job can run in the cluster.In this way,DB-Fair ensures the balance of access load among storage nodes and hardware resource occupancy among computing nodes even when data is stored in a variety of erasure coding modes,and thus speeds up job execution efficiency in multi-job concurrent scenarios.By deploying our proposed scheme in a real heterogeneous cluster,the experimental results show that our scheme can improve the job execution efficiency by 10%-40%in different types of MapReduce jobs.In the multi-job concurrent scenario,our scheme can shorten the completion time of all jobs by about 17%.
Keywords/Search Tags:Distributed System, Erasure Coding, Data Placement, MapReduce, Job Scheduling, Heterogeneous Cluster
PDF Full Text Request
Related items