Font Size: a A A

Research On Multi-tenant Traffic Scheduling Policy For Distributed Deep Learning Training

Posted on:2024-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:W P LuFull Text:PDF
GTID:2568306932455294Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
In scenarios with limited computational capabilities,distributed training,which involves training across multiple computing nodes and accelerators,has become the primary solution for training large-scale models.Past efforts in accelerating distributed deep learning training have mainly focused on reducing communication overhead to lower communication costs,but have paid little attention to network bandwidth usage.Moreover,due to cluster GPU fragmentation,more than one training job is frequently assigned to the same node to reduce waiting time.The competition for network bandwidth among various distributed training jobs becomes a problem that cannot be overlooked,and this competition results in a considerable reduction in training speed.First,we presents a priority traffic scheduling strategy based on the cyclic features of distributed training task flows.The strategy fully considers the communication characteristics of forward computation and parameter updates during distributed deep learning training,allowing for staggered communication timing of different training tasks in the cluster.Based on this,a traffic scheduling system deployable on cluster nodes is built,which can monitor,classify,and predict the traffic of distributed deep learning training states.Compared to the default traffic scheduling strategy in Linux,our strategy can improve training speed by approximately 18%.When there is continuous background traffic,our strategy can further increase the training speed by approximately 22%by prioritizing bandwidth during distributed deep learning training.We then presents a reinforcement learning-based traffic scheduling strategy that further enhances the speed of distributed training.The proposed traffic scheduling strategy is efficiently implemented using Field-Programmable Gate Arrays(FPGAs),considering the training scenarios of federated learning and the computational resources required for encryption and decryption offloading on smart network cards.In the complex environment of smart network cards,the coordination of bandwidth and computational resources is necessary.Therefore,this paper constructs a reinforcement learning agent that automatically generates rational and fair traffic scheduling strategies based on environmental changes.Experimental results demonstrate that the implemented smart network card achieves traffic isolation,packet scheduling based on priorities,and exhibits low latency and high performance in the order of tens of nanoseconds.Furthermore,the reinforcement learning scheduling strategy validated on the smart network card improves training speed by nearly 14%compared to priority-based scheduling strategies.
Keywords/Search Tags:Distributed deep learning training, Federated learning training, Traffic scheduling policy, SmartNic, Reinforcement learning
PDF Full Text Request
Related items