Font Size: a A A

Research And Implementation Of Small File Storage In Mass Education Resources Based On Hadoop

Posted on:2016-11-23Degree:MasterType:Thesis
Country:ChinaCandidate:X R YouFull Text:PDF
GTID:2308330473955207Subject:Information security
Abstract/Summary:PDF Full Text Request
Education resource is learning resources that existing in the network, which has many forms, such as text, video, audio and other forms. Among them, text resources account for more than 80% of all learning resources. The number of text resources is large and size of file is generally for KB level, rarely reach MB level, thus called education resources. At the age of Internet, the scale of online education resources become more and more large, calculation processing is huge, leads to that traditional distributed file system can’t meet the demand for processing massive education resources small files.Hadoop is an open source distributed processing platform, providing a reliable, scalable and efficient method to handle massive data. Hadoop distributed file system HDFS has ability of data storage and performs excellently at large-scale data handling. Unfortunately, HDFS is designed for processing large files, which means there has some shortages in processing massive small files. For instance, the memory of NameNode will be occupied quickly when store massive small files on HDFS, which may cause the memory bottlenecks. When accessing small file frequently, it needs to jump among several DataNode, which leading to the access speed slowly. Compared with large file’s processing, small file processing speed is too slow.In order to solve the storage problem of massive education resources small files on Hadoop platform, this thesis proposes a storage optimization scheme for small files, which includes the following four parts:1) Classification of the associated small files: Judging the size of file before files uploaded to the HDFS cluster, if it’s small file, classified it with classification algorithm, then associated category small files with hierarchical clustering algorithm, generating associated small files.2) Merging of small files: merge classified of associated small files into a large file, upload large file to the HDFS cluster, merging will reduce a lot of small files’ metadata to occupy the memory of NameNode.3) Set up index: establish the index for file large files, when retrieving small file, it will be retrieved rapidly by index file, which improving retrieval speed of small file.4) Metadata cache and associated small file prefetching: After first reading the file, the file metadata and associated small file will be cache to the client. The mechanism of metadata caching and associated small file prefetching can improve the read efficiency of small file.Finally, carry out massive experiments to test the storage optimization scheme of, and design three groups of experiments to respectively compare file writing time, accessing time of small file as well as the memory usage rate of the system. The experimental results show that the storage optimization scheme can reduce a large number of small files to consume the memory of NameNode quickly, improve the random access efficiency of small file and save system resources, and reduce the time of read and write small files.
Keywords/Search Tags:HDFS, education resource small files, merged, indexing, cache
PDF Full Text Request
Related items