| In smart city traffic systems,vehicle flow is an important basic piece of information.Detecting traffic flow accurately and quickly and applying it to an embedded hardware platform is a crucial research topic in smart city traffic systems.In recent years,object detection algorithms based on convolutional neural networks(CNN)have become dominant in many applications due to their superior accuracy compared to traditional schemes.The YOLO(You Only Look Once)series of algorithms has become popular for their combination of speed and precision.However,as CNN recognition accuracy continues to increase,the computational complexity of CNNs also increases,limiting the theoretical algorithms at the hardware level.Currently,CNNs are mostly deployed on Central Processing Units(CPU)and Graphics Processing Units(GPU).However,CPU are mainly composed of memory units and control units,which are not suitable for accelerating deep learning models,while GPU have fast computation speeds but high energy consumption,making them difficult to apply to portable devices.Application-Specific Integrated Circuits(ASIC)can be designed to meet specific requirements,but they have complex design cycles,high initial investment costs,and a lack of reconfigurability.On the other hand,Field-Programmable Gate Arrays(FPGA)combine the advantages of GPU and ASIC by being reconfigurable,high-performance,and lowpower.In this paper,we use an FPGA to build a CNN hardware acceleration platform and accelerate and optimize the YOLOV2(You Only Look Once Version 2)algorithm,achieving the detection of traffic flow on an embedded hardware platform.The main work of this paper is as follows:(1)Optimizing the YOLOV2 network model.Firstly,the YOLOV2 network model is quantized to half-precision floating-point format,reducing its storage space requirements.After quantization,the model size is reduced by half.Secondly,the convolutional layers and BN(Batch Normalization)layers are fused together,improving the network’s inference speed without reducing accuracy.Finally,the Confluence algorithm is used as a postprocessing algorithm,introducing Manhattan distance simplification to obtain the process of all detection boxes under the same target,reducing the complexity of the post-processing algorithm.Through experimental comparison,the optimized YOLOV2 network model’s inference speed has been improved by 0.4s.(2)Designing hardware accelerator utilizing the parallel computing feature of FPGA.This paper adopts a top-down approach to modularize various network layers of YOLOV2,proposes a basic operation method for half-precision floating-point numbers to construct convolutional layers,pooling layers,and other network layers,and builds the YOLOV2 network model in an integrated way from the bottom up.Due to the limited on-chip storage space of FPGA that cannot accommodate all parameters of a layer and the lack of sufficient logic resources to complete all operations,this paper studies the loop unfolding and loop blocking of convolution based on the hardware architecture of FPGA and proposes a pipeline operation based on ping-pong buffering to achieve high-speed data flow by coordinating the input data selection unit and the output data selection unit according to the clock cycle.(3)Based on the XC7Z020 chip,a circuit board is designed for hardware acceleration of YOLOV2 to build a vehicle flow detection system aimed at detecting vehicle flow on the road.The system includes:a USB camera for obtaining real-time images of the road,an OTG interface for powering the circuit board;an SD card for storing the boot image and detection results,a network interface for remotely logging into the Linux system running on the internal PS of the XC7Z020 chip and transmitting the camera image,the XC7Z020 main control chip used for hardware acceleration of the YOLOV2 algorithm.The peripheral circuit is built around this chip to complete the system design.Finally,comparative tests are performed on the experimental results under different devices.The experimental results show that the vehicle flow detection system proposed in this paper takes 2.01 seconds to detect a single frame image,which is faster than the CPU’s single frame image detection time and has much lower power consumption than the GPU,which can effectively achieve the system’s functionality. |