| Object detection is widely used in civil and military fields such as artificial intelligence,medical research,and national defense security.Compared with the traditional algorithm,the performance of deep learning-based object detection algorithm which uses Convolutional Neural Network(CNN)to extract features and complete image classification and positioning is greatly improved.However,CNN often has variable layer parameters and structure,a large amount of parameters and calculations,which makes object detection algorithm difficult to apply in embedded applications with limited resources requirements,requiring high speed and low power.Compared with GPU and ASIC embedded platforms,FPGA has the advantages of low cost,reconfigurability and high energy efficiency.This paper implements hardware acceleration of deep learning-based object detection algorithm on FPGA hardware platform.The main research work is as follows:1.Based on the ZYNQ 7100 heterogeneous hardware platform and combined with the hardware accelerated analysis results of CNN-based object detection algorithm,the paper completed the research task division based on the software and hardware collaborative design ideas and overall architecture design under certain design requirements.2.Based on the overall architecture design,this paper uses the Roofline model to evaluate the theoretical performance of typical deep learning-based object detection algorithms when implemented on the ZYNQ 7100 hardware platform.At the same time,considering the factors such as algorithm detection accuracy and model complexity,the object detection algorithm Mobilenet-SSD is selected,which is most suitable for deployment on the platform.After that,the detection principle and network structure of Mobilenet-SSD algorithm are analyzed so that clarified the software and hardware task allocation scheme for Mobilenet-SSD algorithm.3.Considering standard convolution and the DW(DepthWise)convolution of depth separable convolution in the Mobilenet-SSD,the papaer designs the CNN accelerator in the programmable logic part based on hardware optimization techniques such as parallelism,pipeline and double buffering,and uses the Roofline model to find the best block and parallel calculation coefficients of the CNN accelerator based on the block convolution idea.To ensure that the Mobilenet-SSD accuracy is not lost,the data type processed by the CNN accelerator is 32-bit floating point.Then,the CNN accelerator is called in the DMA data transmission mode,and the function realization of the PS part is completed.4.At the end of this paper,the function verification and performance test were carried out on the GVI CXZ7100 development board.The test results show that the design of this paper is correct and fully meets the requirements.And when the on-chip power consumption is only 8.527 W,the peak computing performance of the CNN accelerator can reach 26.67GOP/S.The processing speed of the CNN accelerator is about 110 times faster than that of the CNN accelerator without using the CNN accelerator.Compared with other related researches,CNN accelerators in this paper have certain advantages in both computational performance and detection throughput. |