Research On Cooperative Hybrid Parameter Update For Data Parallel Deep Learning Training Jobs

Posted on:2024-01-16

Degree:Master

Type:Thesis

Country:China

Candidate:D Xu

Full Text:PDF

GTID:2568307064485424

Subject:Distributed Deep Learning

Abstract/Summary:

PDF Full Text Request

Currently,data parallel has been widely used for training large datasets on distributed deep learning clusters,but at the iteration boundaries,it is subject to expensive global parameter updates.The performance imbalance between training tasks introduced by uneven workload division or biased resource allocation causes the straggler problem,which may seriously affect the training speed and accuracy of model training,and bring new challenges to the parameter updating of data parallel deep learning jobs.In the scenario of load imbalance,the gradient delay caused by straggler problem seriously affects the speed of parameter update and the accuracy of model training.The purpose of this thesis is how to solve the problem.In order to improve the efficiency of model parameter updating and ensure the accuracy of model training,this thesis proposes a Cooperate Grouping Parallel(CGP)method based on data parallel.This method is a hybrid parameter update scheme that dynamically groups tasks based on training performance,using synchronous parameter update for intra-group tasks and asynchronous parameter update for inter-group tasks.Intra-group synchronous parameter update ensures the correctness of parameter update of parallel training tasks and the accuracy of model training,and asynchronous update between groups accelerates the parameter update process and provides the possibility of inter-group cooperation and mutual assistance.The main feature of this method is that based on the flexible grouping mechanism,it supports adjusting the burden of parameter update of task groups by dynamically adjusting the grouping in the parameter update process,so as to achieve the goal of mutual assistance and cooperation between task groups.Based on the hybrid parameter update scheme,this thesis establishes a task cooperation grouping parallel parameter update model for data-parallel deep learning jobs.The model quantifies the training time of parallel deep learning jobs and describes the model accuracy based on the difference between tasks.This thesis proposes the Cooperate Grouping Parallel Problem(CGPP)based on data parallel.The goal of this problem is to dynamically adjust the cooperative grouping scheme to achieve the optimal parallel training speed under the premise of ensuring training accuracy.In this thesis,the task cooperate grouping parallel algorithm CGP is further proposed.The main feature of the algorithm is that it can adapt to the parallel scale of different distributed system scenarios,including the cooperate grouping parallel algorithm CGP＿AGA based on Adaptive Genetic Algorithm that adapts to small and medium-sized parallel granularity,and the cooperate grouping parallel algorithm based on Pareto Local Search Algorithm that adapts to large-scale parallel granularity CGP＿PLS.The main contributions of this thesis are:(1)This thesis proposes a Cooperate Grouping Parallel parameter updating method CGP based on data parallel.A novel Cooperate Grouping Parallel Problem CGPP is proposed to obtain the optimal parallel training speed under the guarantee of ensuring training accuracy.(2)In order to solve CGPP,this thesis proposes a grouping configuration search algorithm for small and medium-sized distributed parallel environment: cooperate grouping parallel parameter update algorithm CGP＿AGA based on Adaptive Genetic algorithm.(3)In order to solve CGPP,this thesis proposes a grouping configuration search algorithm for large-scale distributed parallel environment: cooperate grouping parallel parameter update algorithm CGP＿PLS based on Pareto Local Search.(4)The comprehensive evaluation results prove the effectiveness of CGP under persistent imbalance and fluctuant imbalance.This approach mitigated the effects of imbalances without incurring additional adjustment costs.

Keywords/Search Tags:

Parallel Deep Learning, Distributed System, Parameter Update, Straggler

PDF Full Text Request

Related items

1	Research On Mitigating Straggler And Job Scheduling For Parameter Server
2	Research And Implementation Of Efficient Parameter Communication Technology In Distributed Deep Learning System
3	Research On Parameter-exchanging Optimizing Mechanism In Distributed Deep Learning
4	Research On Optimizing The Training Efficiency Of Distributed Deep Learning For Heterogeneous GPUs
5	Parallel And Distributed Training Of Deep Learning
6	Research On Parameter Communication Optimization For Distributed Machine Learning System
7	Research On Parallel Optimization Methods For Image Recognition
8	Research On Key Technologies For High-Performance Parallel Training Of Large-Scale Deep Learning
9	Research On Parallel Update And Publication Methods For XML Data In Distributed Systems
10	Communication Optimization Of Distributed Deep Learning Based On Gradient Priority