| Object detection is one of the basic tasks of computer vision.With the advent of the era of deep learning,the object detection algorithm based on convolutional neural network has made great progress.As an object detection algorithm with excellent performance and real-time requirements,YOLOv2 detection network has the advantages of simple structure and relatively few network layers,and is a good choice for the industrialization of object detection algorithm.The parallel computation,reconfiguration and low power consumption of FPGA make it a good platform scheme to transplant the object detection algorithm based on convolutional neural network to the embedded system.Firstly,in order to deploy the YOLOv2 detection network more efficiently on the FPGA platform,we optimizatize the storage and calculation of YOLOv2 detection network in this thesis.Using the incremental network quantitative method to quantify the network model,weights of the single-precision floating-point format quantified as the whole power of 2 and use the four bits to encode the weights,YOLOv2 detection network compression nearly 8 times.At the same time,the floating-point multiplication in the convolution layer can be converted into a shift operation,and the detection performance on the Pascal VOC2007 dataset is consistent with the performance before quantization.Secondly,considering the YOLOv2 detection network is based on convolution of the neural network,for the sake of our design is applicable to other target detection algorithm based on convolution neural network,we analyzed and on the FPGA to realize the general convolution neural network module,through the simulation to the function of each module,validate the correctness of our design.Finally,limited by the computational resources and memory bandwidth of the FPGA platform,the YOLOv2 detection network based on FPGA needs to block each layer in the forward inference calculation process.In order to make full use of the computational resources and memory bandwidth of the FPGA platform,this thesis selected the best block parameters through Roofline Model.In addition,in order to reduce the access times to external storage and reduce the inference time,we use double buffering mechanism and introduces pipeline operation between different layers.The experimental result shows that our design can achieve 3.2 frames per second at the development board xc7z035ffg676-2 with a working frequency of 100 MHz. |