Research On Multi-tenant Traffic Scheduling Policy For Distributed Deep Learning Training

Posted on:2024-05-24

Degree:Master

Type:Thesis

Country:China

Candidate:W P Lu

Full Text:PDF

GTID:2568306932455294

Subject:Cyberspace security

Abstract/Summary:

PDF Full Text Request

In scenarios with limited computational capabilities,distributed training,which involves training across multiple computing nodes and accelerators,has become the primary solution for training large-scale models.Past efforts in accelerating distributed deep learning training have mainly focused on reducing communication overhead to lower communication costs,but have paid little attention to network bandwidth usage.Moreover,due to cluster GPU fragmentation,more than one training job is frequently assigned to the same node to reduce waiting time.The competition for network bandwidth among various distributed training jobs becomes a problem that cannot be overlooked,and this competition results in a considerable reduction in training speed.First,we presents a priority traffic scheduling strategy based on the cyclic features of distributed training task flows.The strategy fully considers the communication characteristics of forward computation and parameter updates during distributed deep learning training,allowing for staggered communication timing of different training tasks in the cluster.Based on this,a traffic scheduling system deployable on cluster nodes is built,which can monitor,classify,and predict the traffic of distributed deep learning training states.Compared to the default traffic scheduling strategy in Linux,our strategy can improve training speed by approximately 18%.When there is continuous background traffic,our strategy can further increase the training speed by approximately 22%by prioritizing bandwidth during distributed deep learning training.We then presents a reinforcement learning-based traffic scheduling strategy that further enhances the speed of distributed training.The proposed traffic scheduling strategy is efficiently implemented using Field-Programmable Gate Arrays(FPGAs),considering the training scenarios of federated learning and the computational resources required for encryption and decryption offloading on smart network cards.In the complex environment of smart network cards,the coordination of bandwidth and computational resources is necessary.Therefore,this paper constructs a reinforcement learning agent that automatically generates rational and fair traffic scheduling strategies based on environmental changes.Experimental results demonstrate that the implemented smart network card achieves traffic isolation,packet scheduling based on priorities,and exhibits low latency and high performance in the order of tens of nanoseconds.Furthermore,the reinforcement learning scheduling strategy validated on the smart network card improves training speed by nearly 14%compared to priority-based scheduling strategies.

Keywords/Search Tags:

Distributed deep learning training, Federated learning training, Traffic scheduling policy, SmartNic, Reinforcement learning

PDF Full Text Request

Related items

1	Research On Optimization Methods For Federated Learning Model Training
2	Design And Implementation Of Federated Learning Training Acceleration Scheme Based On Cloud-Edge Collaboration
3	Research On Fast Training Method Of Robotic Arm Based On Deep Reinforcement Learning
4	Research On Multi-agent System Decision Algorithm Based On Deep Reinforcement Learning
5	Deep Reinforcement Learning With Self-Generated Expert Samples
6	Research On Distributed Training For Imbalanced Data
7	Research And Implementation Of Agent Continuous Control Technology Based On Distributed Reinforcement Learning
8	Research On MEC Task Offloading And Resource Scheduling Based On Deep Reinforcement Learning
9	Design And Implementation Of A Decentralized Federated Learning And Data Sharing Platform
10	Research On Goal-Conditioned Hierarchical Multi-Agent Reinforcement Learning For Cooperative Environment