| A distributed storage system has advantages over a centralized system when dealing with large-scale data storage needs.It can efficiently scale,maintain high availability and performance,and effectively meet the challenges of fast-growing massive amounts of data.As a result,distributed storage systems are widely used in major data centers and Internet companies.As hardware and software continue to be upgraded,large-scale distributed storage clusters are more likely to be heterogeneous,when the system contains multiple types of storage media,including mechanical hard drives,solid state drives,and NVMe SSDs,etc.,and the performance optimization of the cluster will face many challenges.To this end,this paper focuses on the performance optimization problems faced by heterogeneous and complex distributed storage systems,with the goal of improving the overall performance of distributed storage systems represented by Ceph,including system data distribution balance and efficient caching optimization strategies for heterogeneous storage pools.The innovative work in this paper is summarized in the following two points:(1)In the scenario of heterogeneous storage device capacity and diverse storage pools,due to the limitation of Cluster Map of Ceph’s data distribution algorithm hierarchical,larger capacity devices may not get relatively equal data storage,resulting in unbalanced data storage for other devices at the same level.And the mgr balancer works without considering the impact of the global view of the pool to place PGs,which can lead to uneven data distribution.For this reason,the paper proposes a cluster topology and storage pool-aware data balancing method based on cluster topology and storage pool,aiming to optimize the balance and performance of the storage system.The method first designs a Dynamic Global Weight Balancer(DGWBalancer),which integrates several factors such as cluster topology,cluster storage utilization,device storage utilization and PG allocation in storage pools,in order to achieve dynamic global adjustment of PG distribution of various devices in the cluster and storage pools,so as to achieve a more reasonable and balanced data distribution.thus achieving a more reasonable and balanced data distribution.(2)For mixed HHDs and SSDs storage scenarios,Ceph Cache Tier provides a tiered caching feature that separates fast and slow storage pools for more efficient management of data objects.Due to the limited total capacity of the cache pool,the cache pool can only store some of the data objects,and higher performance can be achieved when clients focus on accessing hot objects in the cache pool.However,if the client does not hit the data when accessing the cache pool,it generates redundant IO operations,which leads to increased client access latency and reduced throughput.To improve the hit rate of the cache pool,the paper proposes a Temperature Density Cache replacement algorithm(TDC)based on temperature density by calculating the temperature density of the space consumed by each object and evicting the object with the lowest temperature density.The hit rate of the cache pool is improved by evicting the objects that contribute less to the hit rate.The algorithm mainly includes object temperature calculation,temperature density calculation and cache replacement strategy.Then,the TDC algorithm is evaluated by using playback workload IO on a real traces dataset to demonstrate the efficiency of the algorithm in improving cache hit rate.The TDC algorithm is also applied to the Ceph distributed storage system to verify the performance of the Cache Tier based on the TDC algorithm. |