| In recent years,in-memory computing has become a new direction of parallel system research field.As a representative in-memory computing framework,Spark improves performance of parallel computing system by reduces disk I/O in job execution cycle.Compare to the traditional parallel computing framework,the performance of Spark had been improved several times,but there is a big gap to meet the requirements of real-time application in big data,so explorer the optimization actions of in-memory computing framework is an important issue currently facing the information industry.In this paper,we did a series of studies aimed improving the job execution performance of in-memory computing framework and resource utilization of cluster.The main work of this paper including:(1)We summarized the current research situation of in-memory computing technology and Spark computation model.First of all,we discussed the development history of in-memory computing technology from three categories: memory data managing technology,in-memory computing framework and typical performance optimization methods.Secondly,we divided the performance optimization methods of in-memory computing framework into three aspects: the method based on resource configuration,the method based on job scheduling and the last based on failure recovery technology,and we made a comprehensive comparison among them.Finally,we summarized the study of existing research work of in-memory computing model.(2)Proposed the self-adaptive strategy for cache management in Spark.We proposed method to solve the problem of Spark does not have a good strategy to select valuable RDD to cache in limited memory and LRU ignoring computing cost of RDD.A self-adaptive cache management strategy(SACM)is proposed,which comprised of automatic selection algorithm(Selection),parallel cache cleanup algorithm(PCC)and lowest weight replacement algorithm(LWR).Selection algorithm can seek valuable RDDs and cache their partitions to speed up data intensive computations.PCC clean-up the valueless RDDs asynchronously to improve memory utilization.LWR takes comprehensive consideration of the usage frequency of RDD,the RDD’s computation cost,and the size of RDD.(3)Proposed partial data shuffled first strategy for in-memory computing frameworkFocus on the delay of wide dependence stage on job execution,we design a partial data shuffled first algorithm(PDSF)which includes more innovative approaches,such as efficient executors priority scheduling,minimize executor wait time strategy and moderately inclined task allocation and so on.PDSF break through the restriction of parallel computing model,it release the high performance of efficient executors to decrease duration of synchronous operation,and establish adaptive task scheduling scheme to improve the efficiency of job execution.We further analysis the correlative attributes of our algorithm,prove that PDSF conform to Pareto optimum.(4)Proposed the parallelism deduction algorithms for in-memory computing framework.Inappropriate parallelism parameter may result in performance degradation on in-memory computing framework.For this issue,we proposed parallelism deduction algorithms(PDA)for in-memory computing framework.Firstly,Based on the analysis of the relationship between parallelism parameter and job execution efficiency,the definition of parallelism deduction algorithm is given.secondly,we calculate the best parallelism of job execution by size of input data,worker computing resource and additional overhead of spill and scheduling,then prove that PDA can optimal state synchronization on job execution and improve resource utilization.Finally,optimize the task scheduling for each Stage by PDA,accelerate the job execution and improve the calculation efficiency.(5)Research on progressive filling partitioning and mapping algorithm for Spark based on allocation fitness degree.Spark workload distribution method depends only on the number of keys and has no relation to the realistic amount of data.As a result,distribution amount is seriously inconsistent with worker computational ability,which increases the job latency.This phenomenon is even worse in heterogeneous cluster environment.Accounting for these issues,we first analyze the job execution mechanism,establish task efficiency model and shuffle model,then define Allocation Fitness Degree(AEE)and put forward the optimization goal.On the basis of the model definition,this paper proposes the Progressive Filling Partitioning and Mapping algorithm(PFPM).PFPM establish the data distribution scheme adapting Reducers’ computing ability to decrease synchronous latency during shuffle process and increase cluster the computing efficiency.(6)Proposed the regression-checking algorithm for duplicate data detection in memory.For in-memory computing cluster,redundant data takes up a large number of memory spaces and decreases the computing performance of cluster.To slove this problem,sliding blocking algorithm with regression-checking for duplicate data detection(SBRC)is proposed,which divide data into a number of parts,compare corresponding parts between versions via Hash techniques and find out redundant data.For matching-failed segments,our algorithm continues to detect duplicate data in unmatched blocks,thus improving the detection precision. |