| The Hadoop distributed storage system adopts a scalable system structure,uses multiple storage servers to share the storage load,and uses the location server to locate and store information,which can effectively improve the reliability,availability and access efficiency of the cluster,and is easy to expand.HDFS is the core distributed file system of Hadoop.However,the Name Node in HDFS which is the processing center may result the bottleneck of the cluster when encountering the influence of massive small files,which will easily lead to problems such as low access efficiency and memory bottleneck in the central node.This thesis mainly focuses on the research of improvement of the small file access in the distributed storage system based on Hadoop.The main work is as follows:(1)Small file storage performance optimization:This thesis proposes a file association analysis strategy based on the time continuity of small files,and performs file preprocessing before small files are merged.The preprocessed small files are further merged into a pile through the small file merging algorithm based on the worst matching strategy proposed in this thesis,thereby significantly reducing the number of small files in the system.The experimental comparison proves that the algorithm can effectively alleviate the memory load of the Name Node and improve the storage efficiency of files.(2)Optimization of small file reading performance:Compared with the existing research results,the index module is used to create an index for the merged heap files during the file merging process,and the Trie-based index search mechanism is used to achieve fast search and positioning of small files..In addition,this thesis establishes a file hotness model by exploring the access characteristics of files,and proposes a cache replacement strategy CRSH based on file hotness.This strategy is used to dynamically change the files in the cache space to improve the utilization of the cache area.Not only optimizes the use of Name Node memory,but also improves the efficiency of file access.Based on the small file access optimization strategy proposed in(1)and(2),this thesis designs and implements an extended system EHDFS based on Hadoop.The system relies on the scientific research projects involved in the master’s degree.This system cannot effectively relieve the memory pressure of the Name Node,improve the file storage and upload rate,provide a powerful support for solving the storage and reading problems of massive small files,and provide users with a convenient file management experience.It can solve problems in real application scenarios and has practical application value. |