| With the development of modern information technology,the amount of data generated by various data sources has increased exponentially.The ever-expanding data scale and the ever-expanding business demands have made it impossible for traditional single-computer computing to complete the analysis of large datasets.Big data computing has shifted from single-computer computing to distributed data processing.Spark is an in-memory distributed computing framework for big data.Its key data abstraction concept,Resilient Distributed Dataset(RDD),brings significant performance improvements in big data computing.In practical application scenarios,Spark tasks often need to replace RDDs due to insufficient storage memory.Spark uses the Least Recently Used(LRU)algorithm as the cache replacement strategy by default.This algorithm only considers the most recent use time of the RDD as the replacement basis.When performing cache replacement,the RDD that needs to be reused may be expelled.Storage memory causes Spark performance to degrade.Aiming at the problem that the replacement factors in the existing Spark cache replacement strategies are not comprehensively considered,and the cache strategy has not been adjusted for different cluster conditions.This thesis proposes two optimization methods to improve the working performance of Spark and improve the cache hit rate of the cache replacement strategy.The main contributions of this thesis are as follows:1)Aiming at the problem that the cache replacement algorithm has different effects on different machines,this paper designs a memory-aware cache replacement strategy.This strategy designs an RDD weight replacement model for Spark cache replacement,and can adaptively select an appropriate replacement strategy in different RDD sizes and cluster memory ratios to optimize computing performance.2)Aiming at the different degrees of cluster resource usage,this paper designs a Spark cache replacement strategy based on dynamic scheduling.This strategy can detect the usage of machine memory resources and reflect it in the cluster cache replacement strategy in the form of dynamic scheduling parameters,so as to achieve a more fine-grained adjustment effect on changes in machine performance.Finally,this thesis designs corresponding simulation experiments to test and analyze the performance of Spark cache replacement strategy based on memory-aware and Spark cache replacement strategy based on dynamic scheduling.The experimental data show that the two cache replacement strategies proposed in this thesis can effectively reduce the workload of Spark and improve the running efficiency of Spark compared with the LRU algorithm in different scenarios. |