Font Size: a A A

Research On Storage Optimization Based On The Distributed File System HDFS

Posted on:2018-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:X HeFull Text:PDF
GTID:2438330518955133Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The data of exponential growth,exacerbated the problem of data storage space of high cost,energy consumption is too large,according to the relevant statistics,due to the presence of a large number of redundant data and storage structure is not reasonable and effective use of the massive data center storage rate of less than 40%.Aiming at the problem of low rate and storage structure unreasonable data storage space of large data storage solutions encountered when using the existing solution is by eliminating redundant data,optimization of storage structure to improve storage utilization.Most of the methods used to judge the data duplication are using the Hash function to generate the Hash value.By judging the Hash value to determine whether the same.However,due to the Hash function has Hash conflict,resulting in different blocks have the same Hash value.In the storage structure of the existing BDSCAN algorithm has always been able to rely on the experience to set the threshold can not lead to the quality of the clustering results,the processing of large amounts of data,such as inefficient.In this paper,combined with the existing solutions:(1)based on the "CubeHash+ keyword + feature vector" label removal model and GA-DBSCANMR structure optimization model based on genetic algorithm+MapReduce programming framework.Using the weight removal model to determine the same and similar data and delete the redundant data in the storage system;(2)GA-DBSCANMR model for clustering the training sample data set,the index will go to the information table model generated clustering index table,to achieve accurate approximate aggregation block,to reduce the time consumed to enhance the contrast of addressing storage efficiency,realize the optimization of storage structure.The experimental results show that the model can effectively improve the utilization of storage space,and the GA-DBSCANMR model can reduce the time of deletion,and the optimization effect will be more obvious with the increase of data volume.
Keywords/Search Tags:De-duplication model, GA-DBSCANMR model, Storage space, HDFS
PDF Full Text Request
Related items