Research On Replicas Placement And Cache Optimization Of HDFS

Posted on:2019-04-28

Degree:Master

Type:Thesis

Country:China

Candidate:G Chen

Full Text:PDF

GTID:2428330572492943

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology and applications,the amount of data in the networks has exploded.As a result,large-scale data storage has become a research hotspot.At present,HDFS(Hadoop Distributed File System)is widely used in big data fields because of its efficient and stable storage capacity.However,HDFS still has many shortcomings and need continuous improvement.Firstly,HDFS does not consider storage heterogeneity of datanodes and its underlying storage supports only a single storage medium,causing that HDFS clusters cannot make good use of efficient storage devices(e.g.RAM disks,solid-state drives)to improve I/O performance and throughput.Secondly,HDFS cannot distinguish hot data,which makes storage nodes that store hot data become a performance bottleneck of cluster.Finally,when there are a large number of hot and small files in the HDFS cluster,HDFS does not have a good replacement policy to cache hot and small files.Thus,when accessing hot and small files,HDFS needs to constantly interact with HDD,greatly reducing access efficiency of small files.In order to mitigate the above problems,this paper studies and improves the mechanism of HDFS replicas placement and caching.The main work of this paper includes the following aspects:(1)Focusing on the problem that HDFS cannot effectively use RAM disk and SSD,this paper proposed a Heat Perception based Adaptive Move Policy of replicas for hybrid HDFS(HPAMP),to benefit from the storage heterogeneity of datanodes.Specifically,HPAMP places replicas on efficient storage(RAM disks or SSDs)based on the file size;when the cluster is idle,HPAMP uses the gray prediction algorithm to predict the warmth of replicas,and then moves hot replicas in the HDD to a RAM disk or SSD,or moves cold replicas in the RAM disk or SSD to the HDD.In addition,the number of moving replicas can be adaptively adjusted based on space utilization.The experimental results show that HPAMP is 3.4 times,1.89 times and 1.68 times faster than Default Policy of HDFS(DP),Round Robin Policy(RRP)and Tier Aware Policy(TAP)in the TeraGen benchmark,and in the Sort benchmark,HPAMP saves run time by 46.2%,29.2% and 21.3% over DP,RRP and TAP,respectively.(2)Focusing on the problem that there are lots of hot data in HDFS massive small data,this paper proposes a HDFS small file cache management method based on ARC replacement algorithm.This method considers the hot file to be cached.The ARC algorithm can accurately predicts warmth of small files that will be accessed frequently and adds them to the cache,and can dynamically replace the data in the cache.The implementation of cache management greatly reduces the frequency of frequent access to disks in small hot file in HDFS and improves the efficiency of cluster access.The experimental results show that with FIFO,LRU and LFU: in the cache hit rate,the ARC algorithm is the highest;In terms of data access efficiency,the ARC algorithm saves run time by 14.2%,6.1% and 3%,respectively.

Keywords/Search Tags:

HDFS, Distributed storage, Hybrid storage, Replicas, ARC algorithm

PDF Full Text Request

Related items

1	The Technical Research Of Optimization Of File Storage In HDFS
2	Research On Storage Strategy Of Distributed File System HDFS
3	Research And Optimization Of Distributed Storage Based On HDFS
4	Research And Optimization Of The Hybrid Distributed Storage System
5	Research And Implementation On The Distributed Storage System Based On HDFS
6	Research And Implementation Of Storage Policy Of Hybrid Distributed Storage System
7	Research And Implementation Of Distributed Storage Based On HDFS
8	Research On Key Technology Of Cloud Storage Based On Hdfs
9	Research And Implementation Of Small File Storage Model Based On HDFS
10	Research And Optimization Of The Distributed Storage On HDFS