| With the development of cloud computing and big data, the application and requirements of distributed systems are increasing and the environment of the application becomes more and more complex Hadoop distributed system is one of the mainstream distributed systems. It has been widely applied in scientific research and commercial field. Simple and practical design and open source support make Hadoop become one of the most outstanding large data processing platforms. However, the file system in Hadoop system has some shortcomings for the support of massive small files. A large number of small files can cause too much pressure on the management node, which may lead to paralysis of the system. Due to the practical requirements of complex large data use environment, the distributed file system should not only support the storage and access of large amounts of files, but also support the storage and access of massive small files. In order to realize the storage management of massive small files and high performance cache and prefetch mechanism, this dissertation improves the system based on Hadoop files. The research content of this dissertation is divided into two parts: the merging process of small files and the establishment of index,and the cache and prefetch of system metadata and data.In the process of aggregation of small files and the establishment of index, this dissertation adopts the strategy of aggregating small files into a small amount of aggregated files to reduce the memory consumption of Hadoop management nodes.In the process of aggregating small files, the concept of a logical file name is defined for better representation of the dependencies between small files. Besides, the use of an improved cardinality ordering under a custom constraint make the aggregated small files locally relevant.According to the principle of Hadoop archiving technology, an efficient indexing mechanism of small files is designed and it makes the Hadoop system better support the processing of massive small files.In the research of cache and prefetch mechanism of system metadata and data, according to the relevance of aggregation and index design, the distributed storage node is equipped with a specific metadata cache, and the client uses a reasonable cache management structure and prefetch management structure. It present a byte-oriented dynamic data prefetching algorithm based on Hadoop distributed file system block. The design of cache and prefetch mechanism is combined with the design of small file aggregation and index building, and it makes the whole system has effective metadata cache capability and efficient data prefetch function. The system has the characteristics of versatility and expandability,and it has efficient access performance and supports the storage of both large files and small files.In this dissertation, a HDFS-based distributed file system is designed.And through the simulation experiment, this system is compared with the source HDFS and the HDFS that using the archive function to prove the effectiveness of the small files aggregation strategy and the high efficiency of the corresponding cache and prefetch mechanism. |