Research On Optimization Of Dynamic Cache Replacement Strategy For Spark Distributed Dataset

Posted on:2023-08-26

Degree:Master

Type:Thesis

Country:China

Candidate:R H Zhang

Full Text:PDF

GTID:2558306911972349

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of modern information technology,the amount of data generated by various data sources has increased exponentially.The ever-expanding data scale and the ever-expanding business demands have made it impossible for traditional single-computer computing to complete the analysis of large datasets.Big data computing has shifted from single-computer computing to distributed data processing.Spark is an in-memory distributed computing framework for big data.Its key data abstraction concept,Resilient Distributed Dataset(RDD),brings significant performance improvements in big data computing.In practical application scenarios,Spark tasks often need to replace RDDs due to insufficient storage memory.Spark uses the Least Recently Used(LRU)algorithm as the cache replacement strategy by default.This algorithm only considers the most recent use time of the RDD as the replacement basis.When performing cache replacement,the RDD that needs to be reused may be expelled.Storage memory causes Spark performance to degrade.Aiming at the problem that the replacement factors in the existing Spark cache replacement strategies are not comprehensively considered,and the cache strategy has not been adjusted for different cluster conditions.This thesis proposes two optimization methods to improve the working performance of Spark and improve the cache hit rate of the cache replacement strategy.The main contributions of this thesis are as follows:1)Aiming at the problem that the cache replacement algorithm has different effects on different machines,this paper designs a memory-aware cache replacement strategy.This strategy designs an RDD weight replacement model for Spark cache replacement,and can adaptively select an appropriate replacement strategy in different RDD sizes and cluster memory ratios to optimize computing performance.2)Aiming at the different degrees of cluster resource usage,this paper designs a Spark cache replacement strategy based on dynamic scheduling.This strategy can detect the usage of machine memory resources and reflect it in the cluster cache replacement strategy in the form of dynamic scheduling parameters,so as to achieve a more fine-grained adjustment effect on changes in machine performance.Finally,this thesis designs corresponding simulation experiments to test and analyze the performance of Spark cache replacement strategy based on memory-aware and Spark cache replacement strategy based on dynamic scheduling.The experimental data show that the two cache replacement strategies proposed in this thesis can effectively reduce the workload of Spark and improve the running efficiency of Spark compared with the LRU algorithm in different scenarios.

Keywords/Search Tags:

Big data, Resilient distributed dataset, Spark, Memory management, Adaptive cache replacement strategy

PDF Full Text Request

Related items

1	Research On Memory Data Management Technology In Spark
2	Research On Cache Mechanism And Job Scheduling Policy In Spark
3	Research On Memory Management And Cache Replacement Policies In Spark
4	Research On Intermediate Data Balance Placement And Adaptive Cache Replacement Strategy In Spark
5	Research On Spark Caching Strategy Based On Task Structure Optimization
6	Design And Implementation Of Distributed Cache Management System For In-memory Columnar Database
7	Adaptive Memory Management Research Based On In-Memory Computing Characteristics In Spark
8	Research And Implementation On Caching Strategy In Spark
9	Research On Spark Performance Optimization Technology For In-Memory Computing
10	Cache Data Management System For Distributed In-Memory Computing