Font Size: a A A

Fast Sparse Deep Neural Network Inference On GPU

Posted on:2023-09-14Degree:MasterType:Thesis
Country:ChinaCandidate:J XinFull Text:PDF
GTID:2558307043974949Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of artificial intelligence,the parameter size of deep neural network models is increasing.Researchers use techniques such as pruning to transform weight matrix into sparse deep matrix,thus reducing the storage cost and computational overhead of the models.On the other hand,with the increasing GPU computing power and the innovation of high performance computation techniques,fast GPU-based inference system of deep neural networks has matured.The core operation of sparse deep neural networks is Sparse Matrix-dense Matrix Multiplication(Sp MM).The performance of Sp MM is closely related to data features such as the distribution of nonzero elements of sparse matrices.Thus,different optimization methods achieve different optimization effects on different data sets,and no single optimization method can achieve optimal performance on all data sets.To address the above problem,a GPU-based Fast Sparse DNN Inference System(FSDI)is proposed,which takes data features such as non-zero element distribution of sparse matrices as input.FSDI build an Sp MM optimization space model and use data features to search a good method.Specifically,firstly,the Sp MM optimization method is abstracted into search space based on four loop transformations include loop tiling,loop parallel,loop schedule and loop compact.Secondly,the performance evaluation model is proposed by consider the load balancing and the memory access cost,followed by the feature of the sparse matrix to obtain a suitable Sp MM optimization method.Finally,according to the characteristics of the GPU architecture,the search process is accelerated by pruning the optimization space.In addition,for the problem of multiple operators with a large amount of intermediate data,a sparsity-aware operator fusion mechanism is proposed.FSDI can fuse multiple Sp MM operations with few computational dependencies through sparsity analysis and store the intermediate results on the shared memory.By reduces the global memory access overhead,the performance is improved.To further increase the number of fusible operators,a locality-aware data hashing method with high fusion is used to rearrange the non-zero elements,and increasing the locality of the sparse matrix.Performance tests are doing on networks consisting of fully connected layers with 1024 to 65536 neurons,and the results showed a performance improvement of 1.73 times to 13.74 times compared to the H&P system on a single V100 GPU.
Keywords/Search Tags:Parallel Computing, Graphics Processing Unit, Sparse Deep Neural Network, Sparse Matrix-dense Matrix Multiplication
PDF Full Text Request
Related items