Research And Implementation Of Small File Storage In Mass Education Resources Based On Hadoop

Posted on:2016-11-23

Degree:Master

Type:Thesis

Country:China

Candidate:X R You

Full Text:PDF

GTID:2308330473955207

Subject:Information security

Abstract/Summary:

PDF Full Text Request

Education resource is learning resources that existing in the network, which has many forms, such as text, video, audio and other forms. Among them, text resources account for more than 80% of all learning resources. The number of text resources is large and size of file is generally for KB level, rarely reach MB level, thus called education resources. At the age of Internet, the scale of online education resources become more and more large, calculation processing is huge, leads to that traditional distributed file system canâ€™t meet the demand for processing massive education resources small files.Hadoop is an open source distributed processing platform, providing a reliable, scalable and efficient method to handle massive data. Hadoop distributed file system HDFS has ability of data storage and performs excellently at large-scale data handling. Unfortunately, HDFS is designed for processing large files, which means there has some shortages in processing massive small files. For instance, the memory of NameNode will be occupied quickly when store massive small files on HDFS, which may cause the memory bottlenecks. When accessing small file frequently, it needs to jump among several DataNode, which leading to the access speed slowly. Compared with large fileâ€™s processing, small file processing speed is too slow.In order to solve the storage problem of massive education resources small files on Hadoop platform, this thesis proposes a storage optimization scheme for small files, which includes the following four parts:1) Classification of the associated small files: Judging the size of file before files uploaded to the HDFS cluster, if itâ€™s small file, classified it with classification algorithm, then associated category small files with hierarchical clustering algorithm, generating associated small files.2) Merging of small files: merge classified of associated small files into a large file, upload large file to the HDFS cluster, merging will reduce a lot of small filesâ€™ metadata to occupy the memory of NameNode.3) Set up index: establish the index for file large files, when retrieving small file, it will be retrieved rapidly by index file, which improving retrieval speed of small file.4) Metadata cache and associated small file prefetching: After first reading the file, the file metadata and associated small file will be cache to the client. The mechanism of metadata caching and associated small file prefetching can improve the read efficiency of small file.Finally, carry out massive experiments to test the storage optimization scheme of, and design three groups of experiments to respectively compare file writing time, accessing time of small file as well as the memory usage rate of the system. The experimental results show that the storage optimization scheme can reduce a large number of small files to consume the memory of NameNode quickly, improve the random access efficiency of small file and save system resources, and reduce the time of read and write small files.

Keywords/Search Tags:

HDFS, education resource small files, merged, indexing, cache

PDF Full Text Request

Related items

1	Research On The Optimization Of Small Files Processing And Replication Strategy Based On HDFS
2	Research And Optimization Of Mass Small Files Based On HDFS
3	Optimization Of Small Files Accessed Base On MapFile In HDFS
4	Research And Implementation Of Mass Small File Storage System Based On HDFS
5	Research On Efficient Storage Of Small Files In Mobile Ultrasound Detection Based On HDFS
6	Research And Implementation Of Small File Storage Model Based On HDFS
7	Research And Design Of Massive Small Files Merging Based On Hadoop
8	Research On Access Optimization Of Small Files In Hadoop Cluster
9	Research On HDFS Small File Archiving Method
10	Processing Of Small Files Based On HDFS And Optimization And Improvement Of The Performance For Mapreduce Computing Model