| In recent years,deep neural networks have shown amazing modeling capabilities in research fields such as images and speech,and are therefore becoming very popular in academia and industry.For the academic community and industry,it is necessary to have the ability to quickly train algorithm models in order to be able to quickly analyze experimental results and make algorithm adjustments.Therefore,this article first defines the parallel training strategy and process of deep neural network data in a multi-server multi-GPU scenario.Then based on the hook mechanism,this work implements a simple and easy-to-use data parallel distributed training extension interface under the PyTorch framework,and through analysis,it is found that in the data parallel training,the problem of fragmentation of gradient data is not conducive to the redundancy of full reduction communication,so An asynchronous communication strategy is proposed,which aggregates the fragmented gradient data and performs full reduction synchronization,which effectively improves the higher volume in communication.Shrinking,the PyTorch extension of this work also supports mixed precision training,and can achieve up to 1.71 times the training acceleration effect on GPUs with Tensor Core.At the same time,in order to solve the problem of gradient numerical overflow in mixed precision training,here is proposed An adaptive overlap-perceived loss amplification strategy effectively alleviates the problem of gradient non-convergence caused by gradient overlap in mixed-precision training.Therefore,this paper also proves that the data parallel training under multi-machine and multi-card should use local batch normalization,which is especially obvious for the acceleration effect of neural networks with more batch normalization layers.In the end,on 32 GPUs,a maximum efficiency of 99.4%has been achieved,and MobileNet-vl trained on can be completed in 1 hour and 37 minutes.In addition,through the distributed training expansion strategy proposed in this thesis,in the scenario where a large batch of training data is used,not only does the trained model have no decrease in accuracy,it is even higher than the official baseline.For example,ResNet-50 trained in this article reached 78.06%higher than the official 76.86%,while MobileNet-vl reached 73.48%,higher than the official 70.9%. |