Font Size: a A A

System Support For Low-Rank Decomposition Gradient Compression Algorithms In Deep Learning Data Parallel Training

Posted on:2024-08-10Degree:MasterType:Thesis
Country:ChinaCandidate:H WuFull Text:PDF
GTID:2568306929990349Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the continuous development of deep learning,large models have gradually emerged among various deep learning models due to their excellent sample efficiency and strong generalization capabilities,becoming the most competitive models in fields such as natural language processing.However,during the training of large models,training time and memory consumption have become major challenges for researchers.In order to accelerate the training of large models,researchers have proposed many training acceleration techniques,among which distributed training is the most widely used.Distributed training mainly consists of data parallelism,model parallelism,and pipeline parallelism.Among these technologies,data parallelism is widely adopted due to its simplicity and relatively small overhead.However,gradient synchronization has gradually become a performance bottleneck in data parallelism due to the following two challenges:First,as the model size gradually increases,the volumn of data synchronized in gradient synchronization also increases significantly;Second,the mismatch between the development of computing power and bandwidth leads to the inability of synchronization speed between nodes to keep up with the generation speed of gradients.To address this performance bottleneck,researchers have adopted gradient compression algorithms to reduce the communication volume of gradient synchronization.Traditional gradient compression algorithms include quantization,sparsification,and hybrid algorithms.However,in recent research,low-rank decomposition algorithms have gradually become the focus of research because of their linear additivity which makes use of efficient ring communication.Despite this,the academic community still lacks high-performance system support for low-rank decomposition algorithms,making it difficult for researchers and engineers to conduct related research and use in industrial applications.In order to promote the development and application of low-rank decomposition algorithms and meet the needs of researchers and engineers,this article provides system support for low-rank decomposition algorithms in deep learning data parallel training.The contributions of this article are as follows:First,this article provides efficient low-rank decomposition operators and abstract interfaces for low-rank decomposition algorithms.The operators provided in this article cover the common low-rank decomposition algorithms in the recent research.By using these operators,users can easily implement high-performance gradient compression algorithms.Moreover,through the abstract interface provided in this article,users can integrate gradient compression algorithms into mainstream deep learning frameworks at low cost.Taking PowerSGD as an example,we detail the implementation process of this algorithm and explain the implementation of the underlying operators it depends on.Second,this article provides system optimizations for low-rank decomposition algorithms.On the one hand,this article provides a fine-grained task scheduling module for PyTorch.This module ensures that the computation process is executed correctly and efficiently by setting up multiple queues,and also implements hiding gradient communication overhead for DNN computation and gradient compression.On the other hand,this article discusses the choice of gradient communication patterns for low-rank decomposition algorithms.Since low-rank decomposition algorithms have linear additivity,this article analyzes the benefits and additional overheads of the two commonly used gradient communication patterns in the low-rank decomposition case based on this property and ultimately selects the ring communication method.After integrating the above system support,this article conducts end-to-end training experiments on a 16-node cluster with 128 NVIDIA V100 GPUs connected by 100Gbps.Experimental results show that compared with the open-source low-rank decomposition framework(PowerSGD algorithm based on TorchDDP implementation),our framework achieves a performance improvement of 22.7%to 39.3%.In addition,we also conducted corresponding experiments on low-rank decomposition operators and task scheduling modules,respectively,to analyze their contributions to performance boost.Finally,we conducted experiments on GPU utilization and model convergence and found that gradient compression not only improves hardware usage efficiency but also does not have a negative impact on model convergence.
Keywords/Search Tags:Deep Learning, Distributed Training, Data Parallelism, Gradient Com-pression, Low-Rank Decomposition, System Integration
PDF Full Text Request
Related items