| With the rapid development of Internet technology,people have ushered in the era of big data,with more complex data types and larger amounts of data.Big data not only brings diversified commercial value to enterprises,but also brings huge challenges to the storage industry.Therefore,cloud storage takes distributed storage technology as the core and occupys the storage market with its advantages of high efficiency and low fees.However,no matter the current mainstream cloud storage system or the traditional data storage system,it is unavoidable to store a large amount of redundant data.Especially in the archive or backup system,the redundant data can account for 70% of the total storage and greatly increases storage costs.In order to reduce hardware cost and energy consumption,it is necessary to improve the utilization of storage space by deduplicating data.However,when traditional data deduplication is applied to cloud storage,there are two key problems: one is how to balance the deduplication rate and system resource utilization while improving the overall efficiency of the system;the other is how to effectively reduce the impact of deduplication on the cloud The impact of the storage system.Based on the above problems,this thesis has done the following work.1.The existing data deduplication system excessively pursues the deduplication rate,resulting in low system efficiency and high overhead.Aiming at this problem,a data deduplication mechanism based on data similarity clustering is proposed.The mechanism uses the similarity of data to cluster,and a secondary index structure is designed and constructed according to the clustering results.In order to improve the efficiency of data retrieval and comparison,a cache replacement algorithm is designed according to the correlation characteristics between data to further improve the index cache hit rate.Compared with other deduplication mechanisms,the results show that the proposed mechanism can greatly improve the system efficiency while ensuring a high deduplication rate.2.Under the Ceph distributed storage architecture,deduplication will not only aggravate the uneven load problem of the original OSD,but also affect the read and write performance of Ceph.In response to this problem,the mechanism of work 1 is further optimized.OSD’s load rebalancing strategy,which comprehensively considers the performance of system read and write to migrate data,so that the overall load of the cluster is more balanced,and at the same time,by optimizing the read and write process of deduplication,it reduces the impact on Ceph’s read and write efficiency.Compared with the original Ceph distributed storage system,the experimental results show that the OSD load rebalancing strategy can effectively solve the Ceph load imbalance problem caused by deduplication,and improve the read and write efficiency.The research work shows that the deduplication mechanism and OSD load rebalancing strategy based on similarity clustering can not only effectively improve the deduplication efficiency and space utilization,but also greatly reduce the Ceph load imbalance,which has good practical significance. |