Interconnect Network Optimization For Distributed Deep Learning Training Systems

Posted on:2023-12-21

Degree:Master

Type:Thesis

Country:China

Candidate:X Hou

Full Text:PDF

GTID:2568307169978219

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

In recent years,Artificial Intelligence(AI)has profoundly affected our daily lives in fields such as speech recognition and image classification.Deep Learning is one of the important ways to realize AI.Today,with the development of Deep Learning algorithms,Deep Neural Network(DNN)models are becoming more and more complex.At the same time,the development of the Internet has made it easy to obtain large-scale datasets for training DNN models.The proliferation of model parameters and training data has improved the inference accuracy of DNN models while enhancing the capabilities of AI.However,training a complex DNN model requires massive computational and memory resources that can easily exceed the processing power of a single accelerator system.Distributed training using a distributed training system is one approach to this challenge.During distributed training,activations,input gradients,and weight gradients need to be communicated between single accelerator systems.In a distributed training system based on the collective communication framework,the communication between single accelerator systems is accomplished through collective communication operations(including All-to-All,All-Reduce,and All-Gather,etc.).So,interconnection network topology and collective communication algorithms become key factors affecting training performance.In addition,the communication scheduling method of the distributed training system determines the parallelism of communication and computation,as well as the size of the communication data traffic on different links,which also affects the training speed.This thesis is dedicated to optimizing the interconnection network of the distributed training system.By co-designing the interconnection network topology and collective communication algorithm,as well as optimizing the communication scheduling method,the communication efficiency of the distributed training system is improved,thereby improving the training speed of the DNN model.The experimental results show that,compared with the existing designs,the optimized design of the interconnection network proposed in this thesis significantly improves the training speed of DNN.The main contributions of this thesis are as follows:(1)Co-design interconnection network topologies and collective communication algorithms.Firstly,in view of the low efficiency of cross-node communication in the ring topology in the current distributed training system,this thesis implements a fully connected topology inside the Package.Secondly,based on the fully connected topology,this thesis implements an collective communication algorithm with higher communication efficiency.Finally,combined with the Torus network,this thesis implements a new hierarchical interconnection network that improves the efficiency of collective communication operations and the speed of distributed training.(2)Optimize the communication scheduling method of the distributed training system.Firstly,this thesis optimizes the training iteration to maximize the parallelization of communication and computation,which improves the training speed of data parallelism.Secondly,this thesis optimizes the communication scheduling method of the global AllGather operation,and improves the training speed of model parallelism by offloading more communication load to the high-bandwidth interconnection link.

Keywords/Search Tags:

Distributed training, Distributed training system, Collective communication operation, Collective communication algorithm, Communication scheduling

PDF Full Text Request

Related items

1	Research Of Hardware-Based Collective Communication On K-ary N-tree
2	Efficient high performance collective communication for distributed memory environments
3	Communication Optimization Technique For Distributed Synchronous Data Parallel Training
4	Offloading And Optimization Of Collective Communication Operations
5	MPI Collective Communication Optimization On TIANHE High-Speed Interconnect
6	Research On Efficient Collective Communication Algorithms Of Interconnection Networks For Multicomputers
7	Distributed Memory Parallel Numerical Computation Communication Library System
8	High-Performance Training System And Optimizations For Geo-Distributed Machine Learning
9	Accelerating MPI collective communications through hierarchical algorithms with flexible inter-node communication and imbalance awareness
10	Design And Implementation Of Communication Network Training System Software Based On RMI Cache Method