Font Size: a A A

SSD Neural Network Acceleration Based On FPGA

Posted on:2023-07-14Degree:MasterType:Thesis
Country:ChinaCandidate:X C LiFull Text:PDF
GTID:2568306908950819Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of deep learning,object detection algorithms based on convolutional neural network have made breakthroughs in the field of target detection and become one of the most popular research directions in recent years.They are widely used in autonomous driving,video surveillance,medical imaging,radar detection and other fields.Classical object detection algorithms based on convolutional neural network include SSD,YOLOv3 and other algorithms.These algorithms not only bring high performance object detection effect,but also bring huge amounts of parameters and calculation.Therefore,they are currently mainly implemented by GPU and other high computing power platforms.But GPUs have disadvantages such as high power consumption and large size,which are difficult to apply to mobile devices.However,the existing mobile devices often use general processors such as ARM.As for SSD algorithm,which is computation-intensive,the operation efficiency is low and it is difficult to meet the real-time requirements.To solve this problem,this paper proposes a set of algorithm optimization scheme based on FPGA for SSD algorithm,and designs a neural network accelerator based on FPGA,and finally uses accelerator to implement the optimized SSD algorithm.The main work of this paper is as follows:(1)In order to improve the detection accuracy of SSD algorithm without increasing the number of additional parameters and computation,in this paper,Rep VGG block is used to replace a part of 3×3 convolution layer of SSD algorithm backbone network VGG16,and the multi-branch structure of Rep VGG block is fused into a 3×3 convolution layer by structure reparameterization during deployment.Compared with the original SSD algorithm,the improved SSD algorithm improves m AP on the data set used in this paper by 0.78%.In order to distinguish the original SSD algorithm,this paper renames the improved algorithm as rep-SSD algorithm.(2)On the basis of work 1,in view of the large number of parameters and high computational complexity of the Rep-SSD algorithm,this paper uses the parameter sparsity training method to screen out the redundant channels in the Rep-SSD algorithm,and uses the model clipping method to clip the Rep-SSD algorithm.After clipping,the model parameter changes from97.54 MB to 35.42 MB,and the inference speed changes from 145 fps to 204 fps.Then,the Rep VGG block in the Rep-SSD algorithm is fused with the structure reparameterization technique.After the fusion,the model parameters are changed from 35.42 MB to 19.63 MB,and the inference speed is changed from 204 fps to 250 fps.Finally,the Rep-SSD algorithm is quantized by an 8-bit wide layer-by-layer symmetric quantization scheme.After quantization,the model parameters are changed from 19.63 MB to 5.12 MB.(3)On the basis of work 2,in order to facilitate FPGA acceleration,this paper adjusts the inverse quantization layer,Re LU activation layer and quantization layer between adjacent convolutional layers during algorithm quantization.After adjusting Re LU activation layer to quantization layer,the inverse quantization layer and quantization layer are fused into a requantization layer.Through adjustment,the amount of calculation in the quantization layer and the inverse quantization layer can be reduced,and the calculation speed of Rep-SSD algorithm can be improved.(4)On the basis of work 3,this paper uses the software-hardware collaborative design idea to accelerate the Rep-SSD algorithm.The convolutional layer and the pooling layer in the Rep-SSD algorithm are accelerated by the accelerator,and the detection part of Rep-SSD algorithm is calculated by the CPU.The Communication between accelerator and the CPU is adopted ethernet port communication.(5)In the design of accelerator in this paper,an instruction is designed for each network layer of Rep-SSD algorithm,which contains structural information,data storage information and quantitative information of the network layer.Different instructions can be configured to control accelerator to execute different network layers of the algorithm.In addition,the instruction design makes the accelerator highly versatile,and different algorithms can be deployed on the accelerator by changing the instruction.The core of the accelerator adopts three kinds of accelerating arrays,namely,3×3 convolution unit array with the size of 16×16,1×1 convolution unit array with the size of 18×18,and pooling unit array with the size of1×9,which can accelerate 3×3 convolution layers,1×1 convolution layers and pooling layers.And these arrays can be used for larger network structures by time-sharing multiplexing.(6)In order to make more effective use of the 3×3 convolution unit array,the 3×3convolution unit array with the size of 16×16 is divided into two 3×3 convolution unit arrays with the size of 8×16.The accelerator can flexibly combine these two arrays according to the step size of the 3×3 convolution layer.
Keywords/Search Tags:Accelerator, FPGA, SSD object detection algorithm, Neural network clipping
PDF Full Text Request
Related items