| Deep Neural Network(DNN)parallelization has become a hot issue in high-performance computing and one of the best ways to solve the slow training of large-scale models.However,with the continuous development of deep learning application scenarios,network models containing more parameter information are designed to solve more complex tasks,leading to increased time consumption during model training.Traditional parallel methods mainly include data parallelism and model parallelism,which have limitations such as high communication overhead and low parallel gain.This paper proposes a parallel acceleration method for deep neural networks based on data parallelism to solve the above problems,with the following main contributions.(1)A coarse-grained parallelism method based on interlayer is proposed.To address the problems of long training time and large communication overhead of traditional data parallelism and model parallelism,this paper combines data parallelism and pipeline parallelism to accelerate the model training process.First,we propose a hybrid parallel architecture for inter-layer pipeline parallelism,which can fully exploit the inherent parallelism in deep neural networks.Second,a model partitioning algorithm is proposed to find the optimal partitioning of neural network models and maximize the load balancing among multiple streams.Then,an inter-layer pipeline parallelism mechanism is proposed to maximize the training process with overlapping computation and communication and to ensure that computational resources can be efficiently and concurrently utilized during neural network training.Finally,a multi-stream based gradient synchronization strategy is proposed,aiming to further reduce the overhead of gradient synchronization during the training process.(2)A fine-grained intra-layer acceleration optimization method is proposed.To address the problems of long computation time and low memory utilization in deep convolutional layers,this paper proposes a deep convolutional layer acceleration optimization method,which enables the computation process to make full use of computational resources and thus shorten the intra-layer computation time.Firstly,we analyze the parallelism of deep convolutional layers and propose a two-level partitioning strategy based on TVM,which can reduce the shared memory conflicts while accelerating the computation process.Secondly,the first-level partitioning method and the second-level partitioning method are proposed for the deep convolutional layer,and the optimal parameter configuration algorithm is proposed for the second-level partitioning method,aiming to find the parameter combination with the minimum computational overhead.Finally,the computational process within the layer is further accelerated by combining the two-level partitioning strategy and virtual threads.The proposed method is tested on a server which has multiple cards with three standard datasets(CIFAR10,CIFAR100,and Caltech 101)to evaluate the performance of ResNet-18,ResNet-34,and ResNet-50,respectively.The experimental results show that the training time of the three models using the proposed multi-stream interlayer parallelism method has a certain acceleration effect compared with the traditional training method,and the convergence of the models can be guaranteed while improving the training speed.ResNet-18,ResNet-34,and ResNet-50 have acceleration ratios of 1.47×,1.39×,and 2.44×,respectively,compared with the traditional training method.The optimization method proposed in this paper can effectively improve the acceleration ratio performance during training in deep neural network training,which has some practical significance in the field of deep learning parallelization. |