| Distributed deep neural network(DNN)training has been widely deployed in largescale computing clusters,and the intensive communication and synchronization cost of gradients and parameters during training is becoming a performance bottleneck for distributed deep learning training.Effectively measuring the communication process and understanding the overall training process are essential for discovering and optimizing such communication bottlenecks.However,many existing communication measurement tools and communication optimization methods still have serious limitations.In this paper,two studies are carried out in the field of distributed deep learning communication optimization.(1)Optimizing communication measurement tools in the deep learning framework MXNet.(2)Address communication bottlenecks in the popular communications framework Horovod when using TCP protocols.Firstly,this paper makes the first attempt to propose an open source,fine-grained,user-friendly communication measurement tool v Sketch DLC in deep learning framework MXNet.Many existing communication measurement tools,such as MXNet profiler,cannot satisfy three requirements simultaneously,namely,fine-grained measurement of low-level communication operations,occupying as little computing resources as possible,and providing comprehensive measurement results that are convenient for user analysis.Therefore,it is difficult for users to quickly find communication bottlenecks.v Sketch DLC can track the low-level communication events between the deep learning framework and the communication library interface,and capture the end-to-end communication between devices.It supports generating communication records in standard format,so users can analyze communication records only by using standard visualization tools such as Chrome Trace Viewer.In addition,v Sketch DLC enables measurement behavior not to affect training performance.We have done a lot of experiments to verify the effectiveness of v Sketch DLC for MXNet.The experimental results show that v Sketch DLC enables users to analyze communication records through friendly interaction,observe the relationship between different communications,and identify potential training bottlenecks from multiple perspectives such as time,iteration,and DNN layer to seek ways to improve training performance.Secondly,this paper optimizes the communication bottleneck in the existing Horovod framework on public cloud from the overall perspective of distributed training.Distributed deep learning training on public cloud GPU clusters is one of the most recommended practices.Existing public cloud GPU data centers,such as Amazon EC2 and Alibaba GPU clouds,are usually equipped with commodity high-speed Ethernet and TCP networks.However,on this configured public cloud,Horovod,one of the most popular distributed communication frameworks,is difficult to achieve performance matching cluster configurations.This is related to the difficulty of Horovod in using CPU resources to alleviate TCP protocol stack overhead.At this time,only optimizing the upper algorithm or improving the network performance cannot solve the communication bottleneck.This paper make the first attempt to improve the messaging interface of Horovod to solve the mismatch between computing and communication capabilities when deploying Horovod in a TCP-based public cloud GPU cluster.We propose Fast Horovod,which uses more low-cost auxiliary CPU communication processes to transmit messages in parallel to speed up communication,improve the utilization of high-cost GPU resources and network bandwidth resources,and achieve cost-effective distributed training.Experimental results show that Fast Horovod can significantly accelerate communication in distributed training on public cloud clusters equipped with TCP,which increases the training speed of Alex Net and VGG16 models by 64.5 % and 72.6 %,respectively. |