Font Size: a A A

Research On High Utilization Rate And Strong Scalability Of HDFS Storage

Posted on:2020-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhangFull Text:PDF
GTID:2428330590463879Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With its high fault tolerance and reliability,HDFS has become the most widely used distributed file system in the field of large data storage.However,with the continuous development of the era of large data,the amount of data presents explosive growth,which requires HDFS to have higher storage utilization and strong scalability.Based on the above requirements,this paper finds the following three issues on the basis of in-depth analysis of HDFS:(1)HDFS achieves data redundancy through 3x replica strategy,which guarantees high reliability of file data.However,its additional replica is rarely accessed during normal operation,but it increases storage space and other resource overhead by 200%,and the utilization of storage space is low.(2)When HDFS stores a large number of small files,it will generate a large amount of metadata and increase the memory consumption and load of Namenode,which will affect the storage performance of HDFS.(3)Metadata in HDFS is stored in two files,FSImage and EditLog,and managed by Namenode loading into memory.This file-based metadata management strategy makes Namenode the bottleneck of HDFS scalability.In order to improve the storage space utilization and scalability of HDFS,L-HDFS,a highly scalable distributed file system based on HDFS is designed to solve the above three problems.The research contents and achievements of this paper mainly include:(1)A erasure code localization algorithm CLRC based on RS code is proposed to achieve HDFS data redundancy.Compared with multi-replica strategy,it significantly improves the utilization of storage space.At the same time,RS codes are improved by adding local check blocks to reduce the number of code blocks needed for data recovery.The experimental results show that compared with RS code,CLRC code can save bandwidth and I/O consumption in data recovery,decoding time is shorter,and has higher data recovery efficiency.(2)A small file merge storage optimization algorithm FEMA is proposed.The memory consumption of Namenode is reduced by merging small files into large files.The index of small files to blocks is established by the logical file anme,which generated through the encoding of file ID and block ID,and the caching prefetching mechanism is established to improve the access efficiency of small files.The experimental results show that FEMA algorithm effectively reduces the memory consumption of Namenode and has higher random reading performance.(3)A new metadata management scheme MBR based on RDBMS is proposed to improve the scalability of HDFS.In the first stage,the process of writing RDBMS metadata is designed and implemented.In the second stage,all original HDFS metadata files are abandoned and the reading process on RDBMS is developed,so that HDFS can work normally only through the newly constructed integrated metadata base.The experimental results show that the memory consumption of L-HDFS Namenode does not change with the increase of files or directories,so the cluster can be expanded to a greater extent,even to achieve distributed deployment across clusters.
Keywords/Search Tags:HDFS, erasure code, small file storage, RDBMS, metadata management
PDF Full Text Request
Related items