Font Size: a A A

Acceleration And Optimization Of Deep Learning Algorithm Based On Embedded GPU Platform

Posted on:2020-06-27Degree:MasterType:Thesis
Country:ChinaCandidate:X YinFull Text:PDF
GTID:2428330623463694Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid improvement of computing power,the deep learning network has been rapidly developed,and has been widely used in speech recognition,computer vision and natural language processing or other research.In order to extract more effective features,the number of layers of the deep learning network grows faster,and has the characteristics of large size and many parameters,so it requires high-performance GPU and other devices to provide computing power support.On the other hand,with the rapid development of embedded or mobile devices such as drone,robot and smartphone,the need to deploy deep learning networks on these devices has become more intense.However,resources on these real-time application platforms(such as storage,computing and battery power)are very limited.Therefore,accelerating and optimizing deep learning networks in resource-constrained platforms has become a research topic in academia and industry.Based on this,more and more network acceleration and optimization algorithms are proposed,but these algorithms are based on the image classification task,and the deep learning model rarely combine with the object detection task,or use multiple compression methods in the same task,which is also the focus of this paper.This paper first introduces the research background and development trend of deep learning network acceleration and optimization,and comprehensively summarizes and expounds the current main model compression algorithms.Then,for the target device embedded GPU platform Jetson TX2,the following two tasks are completed:First,for the classic object detection task PASCAL VOC,we compare the results of the Roofline model on popular object detection algorithm based on deep learning,which are also actual deployed and verificated on the Jetson TX2.Considering the accuracy,efficiency and model size of different algorithms,We finally redesigned an efficient object detection network using depthwise separable convolution method.Compared to YOLOv2,it reduced the accuracy of the model by 5%,but the detection speed increased by 150%,and the size of the model was also compressed by 80%.Then,based on the new detection network,the filter-level pruning method is used to further compress and accelerate the model.And the detection speed of the network is increased by 20%,and the storage space is reduced by about 55%.Second,we use S~3FD as the basic detection algorithm and optimize it based on the micro face detection under the real surveillance camera.One aspect of optimization is based on CUDA to add S~3FD missing network layer and optimizing its calculation graph,so that the per frame detection time on the Jetson TX2platform is increased from 0.69s to 0.45s.On the other hand,the INT8 and FP16 data types are used for network quantization acceleration.After a series of operations such as calibration,the per frame detection time can be increased to 0.27s(FP16)and 0.14s(INT8).Finally,based on the optimized detection network,a tiny face automatic detection system is built,realizing the real-time effect on the PC side(GTX 1080Ti).
Keywords/Search Tags:Deep Learning, Object Detection, Embedded GPU, Network Compression
PDF Full Text Request
Related items