Font Size: a A A

Research On Acceleration Method Of Deep Convolutional Neural Networks Based On Hybrid Parallelism

Posted on:2023-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:R D LiuFull Text:PDF
GTID:2568306776978169Subject:Engineering
Abstract/Summary:PDF Full Text Request
In recent years,Deep Convolutional Neural Networks(DCNNs)have made great progress in computer vision fields such as image classification,semantic segmentation,and target detection.However,in order to improve the generalization accuracy and quality of the models,the scale of DCNNs is continuously increasing,which makes the training process of the models require a lot of time and computational resources.The current distributed parallel accelerated training methods are mainly Data Parallelism(DP)and Model Parallelism(MP).However,this single parallel method is limited by the inherent model dependency and GPU memory,and the communication overhead is high.So the parallel gain needs to be further explored and improved.To solve the above problems,a grouped pipeline hybrid parallel training method(GroPipe)is proposed.It breaks through the limitation of single GPU memory by integrating the advantages of both data parallelism and model parallelism.The training process of large models is accelerated by exploiting dependencies,dynamically mining the parallelism of model runtime,and effectively balancing the communication overhead by overlapping computation and communication.The specific work is as follows:(1)An intra-group pipelined model parallelism method is proposed.To address the challenge that large models cannot be trained due to the limitation of single GPU memory,this work extends the single GPU to multiple GPUs and uses a pipelined model parallelism method for efficient training.First,an automatic model partitioning algorithm is proposed to automatically divide the large model into multiple partitions by layers.The model of each partition is placed on the corresponding GPU device.Then,the mini-batch is further divided into multiple micro-batches,which are sequentially fed into each partition.Next,the dependencies between each micro-batch and each partition in pipeline parallelism are analyzed,and dependency operators are constructed which are added to the computational graph of the model.Finally,a pipelined model parallel scheduling algorithm is designed to execute each task in the forward and backward,making full use of each GPU computing resource and improving GPU utilization.(2)An inter-group data parallelism approach is proposed.To address the limited parallelism of intra-group model training,the data parallelism and an optimized communication strategy are introduced among groups to further increase the parallelism of model training and speed up model training.First,the traditional data parallelism method is extended,and each group is trained independently with inter-group data parallelism by aggregating multiple GPUs to load the model.Then,a partition-based delayed communication strategy is proposed for gradient tensor synchronization during backpropagation of DCNNs.This effectively reduces the gradient tensor fragmentation,improves the bandwidth utilization,and overlaps computation with gradient synchronization.Finally,to further accelerate the convergence speed of the model,a decay strategy based on a hybrid cosine and linear learning rate is proposed.The GroPipe proposed is implemented based on the PyTorch framework,and all experiments are conducted on a server with 8 GPUs.The experimental results show that the proposed GroPipe method effectively accelerates the training of neural networks without losing Top-1 accuracy.Compared with the DP and torchgpipe methods,GroPipe improves the speedup performance by 59.6% and 14.3% on Res Net-50,and by 111.2% and 30.9% on VGG-16,respectively.In summary,the proposed GroPipe can achieve effective performance improvement in large model training,and has wide application prospects and practical significance in academia and industry.
Keywords/Search Tags:Pipeline parallelism, Hybrid parallelism, Automatic partitioning, Parallel computing, Deep convolutional neural networks
PDF Full Text Request
Related items