Research On Gpu-Based Acceleration Of Tensor Convolution Calculation

Posted on:2021-07-22

Degree:Master

Type:Thesis

Country:China

Candidate:T Xie

Full Text:PDF

GTID:2518306557986639

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

As deep neural network has been achieving great successes in image recognition,speech recognition,natural language processing and other fields,there are more and more researchers and companies are engaged in deep learning research.However,a deep neural network generally has billions of parameters so that it costs a lot of time to train a state-of-the-art model.In the last decade,GPU develops rapidly,and owns strong computing performance much more than CPU.And based on CUDA,parallel accelerating technologies make the training time reduce greatly.Despite this,neural network computing still costs a lot of time,and it needs further optimization on GPU.For convolutional neural networks,convolution operations generally occupy 90% of the total network operation time.So tensor convolution is always the focus of the optimization of convolutional neural networks.After a profound study of the existing convolution parallel accelerating algorithms,this paper proposed an improved convolution acceleration scheme,based on the idea of unrolling the convolution into matrix multiplication.The main work of this paper are as follows:1)The network architecture and layer operations characteristics are analyzed,and the tensor convolution is focused on.Based on the characteristics of weights sharing and sparse connection,we conduct the convolution procedure in detail.In the meantime,we focus on the GPU hardware architecture,memory hierarchy,thread execution model and CUDA.In addition,we describe the memory accessing characteristics and programming model in detail.2)Based on GPU platform,we adopt the ideas of unrolling convolution into matrix multiplication,and re-design the tensor convolution algorithms.We rotate the tensor so that it can be accessed coalesced,which results in accessing efficiency.Combined the understanding of GPU hardware architecture,we make a manual optimization of the algorithm implementation,including SASS code level optimization.3)Through theoretical analysis,we calculate the upper bound of our implantation performance.This can hint the optimization zoom of our implementation.And in the last,we compare the performance of our implementation with cu DNN.It turns out that when batch size is large and kernel size varying from 2 × 2 to 7 × 7,our implementation has a better performance than cu DNN,especially in dilated convolution.With comparison,our convolution algorithm proposed in this paper has certain advantages.

Keywords/Search Tags:

GPU, CUDA, Tensor convolution, SASS, parallel computing

PDF Full Text Request

Related items

1	Research And Implementation Of Efficient Parallel Computing Method Based On Tensor
2	Research And Application Of The Parallel Inversion Algorithms Based On The Full Tensor Gravity Gradiometry Data
3	Implementation Of Two-dimensional DFT Parallel Algorithm On CUDA
4	Research Based On CUDA Parallel Computation Of FFT
5	Design And Implementation Of Parallel SM4-GCM Based On CUDA
6	Implementation And Application Of Parallel And Incremental Tensor-train-based System
7	A Study On The Parallel Computing Methods Of Visual Hull Based On CUDA
8	Parallel Design And Implementation Of AP Clustering Algorithms Based On CUDA
9	A Parallel Image Stabilization Algorithm Based On CUDA
10	The Research Of Parallel FastSLAM Algorithm Based On CUDA