Font Size: a A A

Runtime Optimization For Large-Scale Neural-Network Data-Parallelism Training

Posted on:2020-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:C F JiaFull Text:PDF
GTID:2428330578983122Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Artificial intelligence(AI)technology has become more and more mature after years of research.Based on Artificial Neural Network,the Deep Learning technology has become a research hotspot in the AI field due to its outstanding effects.With the development of Deep Learning,neural network models gradually become more com-plex,and the sample data used for network training is also growing rapidly.Due to the parametric characteristics of the neural network,the more data that is used for training and the more training iterations,the better the final training results.However,these will lead to a large increase in the amount of computation,which greatly extends the training time.Based on the multi-node CPU+GPU computing platform,this paper optimizes the design of distributed training runtime for deep learning framework,which improves the effect of large-scale deep learning training as well as the utilization of computing cluster resources,and finally reduces the training time.The main contents and results of this paper are:1.Migrating the open source deep learning framework TensorFlow from TCP/IP to RDMA implementation,improve the data transmission bandwidth between different nodes in a distributed environment.This paper firstly performs RDMA porting with the gRPC communication framework used by TensorFlow.Then another approach is taken,the data transmission part of TensorFlow is directly modified into the RDMA implementation.In the final test,the optimized TensorFlow can reach the max band-width that the hardware can support when transmitting large block data.Based on the engineering experience gained in the optimization process,this paper finally completed a set of independent RDMA communication framework,so that other applications with the same requirements can be quickly transplanted and optimized.2.A variety of optimization schemes are designed and implemented for distributed data parallel computing and communication modes,so that distributed deep learning training can be completed efficiently.This paper mainly uses the software pipeline scheme to cover the data delay caused by parameter synchronization,and further work improves the training speed of the neural network on the GPU card with the mixed pre-cision training scheme.Finally,the problem of the batch normalization method in the distributed environment is corrected.With a series of adjustments to the optimization algorithm and hyperparameters,the optimization scheme of this paper has been well verified in the ImageNet dataset.Some of the research reports and technical results of this article have been open sourced and have received the attention of many developers in the open source commu-nity.At the same time,the research conclusions of this paper will provide reference for the performance optimization of distributed deep learning training in the domestic processor environment.
Keywords/Search Tags:Deep Learning, Distributed Training, Data Parallelism, RDMA
PDF Full Text Request
Related items