| As deep neural network has been achieving great successes in image recognition,speech recognition,natural language processing and other fields,there are more and more researchers and companies are engaged in deep learning research.However,a deep neural network generally has billions of parameters so that it costs a lot of time to train a state-of-the-art model.In the last decade,GPU develops rapidly,and owns strong computing performance much more than CPU.And based on CUDA,parallel accelerating technologies make the training time reduce greatly.Despite this,neural network computing still costs a lot of time,and it needs further optimization on GPU.For convolutional neural networks,convolution operations generally occupy 90% of the total network operation time.So tensor convolution is always the focus of the optimization of convolutional neural networks.After a profound study of the existing convolution parallel accelerating algorithms,this paper proposed an improved convolution acceleration scheme,based on the idea of unrolling the convolution into matrix multiplication.The main work of this paper are as follows:1)The network architecture and layer operations characteristics are analyzed,and the tensor convolution is focused on.Based on the characteristics of weights sharing and sparse connection,we conduct the convolution procedure in detail.In the meantime,we focus on the GPU hardware architecture,memory hierarchy,thread execution model and CUDA.In addition,we describe the memory accessing characteristics and programming model in detail.2)Based on GPU platform,we adopt the ideas of unrolling convolution into matrix multiplication,and re-design the tensor convolution algorithms.We rotate the tensor so that it can be accessed coalesced,which results in accessing efficiency.Combined the understanding of GPU hardware architecture,we make a manual optimization of the algorithm implementation,including SASS code level optimization.3)Through theoretical analysis,we calculate the upper bound of our implantation performance.This can hint the optimization zoom of our implementation.And in the last,we compare the performance of our implementation with cu DNN.It turns out that when batch size is large and kernel size varying from 2 × 2 to 7 × 7,our implementation has a better performance than cu DNN,especially in dilated convolution.With comparison,our convolution algorithm proposed in this paper has certain advantages. |