Font Size: a A A

Research On Storage Optimization Technology Based On HDFS In Cloud Environment

Posted on:2020-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:F Z ChenFull Text:PDF
GTID:2428330590996026Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Under the background of big data,the value of data is more and more prominent.As a mass data storage model,cloud storage has become a hot research point.HDFS(Hadoop Distributed File System)clusters based on Docker containers have attracted the attention of many researchers because of their high throughput of data,rapid deployment of clusters,and the ability to run on inexpensive devices.However,the cluster has the reliability issue of data storage.Thus,it is necessary to optimize the data persistence technology and the algorithm of data replica placement.Moreover,although the data block backup of the HDFS cluster can ensure the security of data storage to a certain extent,the HDFS cluster cannot effectively perform flexible storage backup of various types of data in the cloud environment.The storage requirements of different types of data in the cloud environment need to be adjusted correspondly.Therefore,the data partitioning algorithm and the backup strategy need to be optimized accordingly.This thesis focuses on the storage optimization technology of HDFS in the cloud environment,mainly including the three aspects as follows.Firstly,for the reliability issue of data storage on HDFS clusters based on Docker containers,the data persistence technology is proposed to realize data sharing and the data persistence between the containerized HDFS cluster based on the technology of data volume and data volume container.The persistent data includes various types of data stored by the cluster and metadata of each Hadoop cluster node.Moreover,a data copy placement algorithm based on HDFS is proposed.When backing up the data block storage,this algorithm considers the performance of the host machine and the container node comprehensively,which can improve the reliability of cluster data storage and can also reduce the difference of available storage space between nodes.The experiment results show that the data persistence technology and the data copy placement algorithm can effectively migrate the cluster data,improve the I/O performance of the cluster,and also enhance the reliability of data storage greatly.Secondly,for the single backup strategy of HDFS clusters,the storage architecture based on Federation HDFS is used instead of the traditional HDFS cluster.For the data partitioned by the data partition algorithm,different storage strategies are applied to store the data in this storage architecture.Moreover,the data partition algorithm which is suitable for the large data environment is proposed.This algorithm assigns the values of data features and distances by means of quadratic weights to ensure the efficiency and improve the accuracy of data partitioning.The experiment result shows that the algorithm can effectively improve the accuracy and efficiency of data partition.And the data storage architecture based on Federation HDFS can reduce the waste of storage space and achieve effective data storage while implementing flexible storage backup.Finally,to solve the storage problems proposed above,a prototype system is designed and implemented,which are described from the four aspects including data storage reliability,data storage memory,data I/O access and data backup.The system test result demonstrates that: firstly,the HDFS cluster data persistence technology based on Docker container and the data storage replica placement algorithm can ensure data persistent storage and improve data I/O performance;secondly,the KNN-based data partitioning algorithm and the Federation HDFS cluster architecture can effectively ensure flexible storage backup of data and improve storage space utilization.
Keywords/Search Tags:Cloud Storage, HDFS, Docker Container, Data Persistence Technology, Data Partitioning
PDF Full Text Request
Related items