| With the rapid development of Internet technology and applications,the amount of data in the networks has exploded.As a result,large-scale data storage has become a research hotspot.At present,HDFS(Hadoop Distributed File System)is widely used in big data fields because of its efficient and stable storage capacity.However,HDFS still has many shortcomings and need continuous improvement.Firstly,HDFS does not consider storage heterogeneity of datanodes and its underlying storage supports only a single storage medium,causing that HDFS clusters cannot make good use of efficient storage devices(e.g.RAM disks,solid-state drives)to improve I/O performance and throughput.Secondly,HDFS cannot distinguish hot data,which makes storage nodes that store hot data become a performance bottleneck of cluster.Finally,when there are a large number of hot and small files in the HDFS cluster,HDFS does not have a good replacement policy to cache hot and small files.Thus,when accessing hot and small files,HDFS needs to constantly interact with HDD,greatly reducing access efficiency of small files.In order to mitigate the above problems,this paper studies and improves the mechanism of HDFS replicas placement and caching.The main work of this paper includes the following aspects:(1)Focusing on the problem that HDFS cannot effectively use RAM disk and SSD,this paper proposed a Heat Perception based Adaptive Move Policy of replicas for hybrid HDFS(HPAMP),to benefit from the storage heterogeneity of datanodes.Specifically,HPAMP places replicas on efficient storage(RAM disks or SSDs)based on the file size;when the cluster is idle,HPAMP uses the gray prediction algorithm to predict the warmth of replicas,and then moves hot replicas in the HDD to a RAM disk or SSD,or moves cold replicas in the RAM disk or SSD to the HDD.In addition,the number of moving replicas can be adaptively adjusted based on space utilization.The experimental results show that HPAMP is 3.4 times,1.89 times and 1.68 times faster than Default Policy of HDFS(DP),Round Robin Policy(RRP)and Tier Aware Policy(TAP)in the TeraGen benchmark,and in the Sort benchmark,HPAMP saves run time by 46.2%,29.2% and 21.3% over DP,RRP and TAP,respectively.(2)Focusing on the problem that there are lots of hot data in HDFS massive small data,this paper proposes a HDFS small file cache management method based on ARC replacement algorithm.This method considers the hot file to be cached.The ARC algorithm can accurately predicts warmth of small files that will be accessed frequently and adds them to the cache,and can dynamically replace the data in the cache.The implementation of cache management greatly reduces the frequency of frequent access to disks in small hot file in HDFS and improves the efficiency of cluster access.The experimental results show that with FIFO,LRU and LFU: in the cache hit rate,the ARC algorithm is the highest;In terms of data access efficiency,the ARC algorithm saves run time by 14.2%,6.1% and 3%,respectively. |