| With the development of artificial intelligence and the increase in edge devices,the demand for deploying artificial intelligence algorithms at the edge is also growing.Deploying neural networks on processors based on homogeneous architecture is limited by power consumption and bandwidth,making it difficult to adapt to diverse neural network structures and different task requirements.However,processors based on heterogeneous architecture can perform a large number of parallel computations at low power consumption,which is well suited for deploying neural networks on edge devices.Therefore,this paper uses Xilinx’s Zed Board as a heterogeneous acceleration platform,and adopts Winograd fast convolution algorithm and dynamic parameter fixed-point quantization strategy to design a heterogeneous accelerator suitable for deploying convolutional neural networks at the edge.The main work of this paper is as follows:Firstly,this paper summarizes the research background,domestic and foreign development status,basic models,and calculation principles of convolutional neural networks(CNN)and their heterogeneous accelerators.In terms of the development platform and tools of heterogeneous accelerators,this paper selects HLS(High-Level Synthesis)high-level development tools and Zed Board heterogeneous development platform,combined with the computing characteristics of CNN and the performance advantages of heterogeneous platforms.In terms of the design of the heterogeneous accelerator,this paper targets the Le Net-5 model,analyzes the parallelism characteristics of the neural network forward propagation mechanism and its internal operations.Taking a typical accelerator architecture as an example,it explores the data transmission process of the heterogeneous accelerator,resource allocation of software and hardware,and efficient parallel optimization strategies,laying a theoretical foundation for the design of the heterogeneous accelerator architecture and embedded system development in later chapters.Secondly,this paper designs a CNN heterogeneous accelerator IP(intellectual property)core and an on-chip embedded system,which achieves excellent performance with low power consumption.The design idea of the heterogeneous accelerator IP is based on three aspects:dynamic parameter fixed-point quantization,fast convolution algorithm,and parallelism optimization strategy.Dynamic parameter fixed-point quantization dynamically adjusts the data quantization bit width by statistically analyzing the actual distribution range of the parameters,ensuring accuracy while reducing storage space and computation,and improving the stability and robustness of the CNN model.The fast convolution algorithm can effectively speed up convolution operations,reduce system delays,and this paper adopts the fast convolution algorithm based on Winograd transformation,greatly reducing the number of multiplications in convolution calculations and lowering the computational complexity.Parallelism optimization strategy combines the parallel computing characteristics of CNN and the parallel computing capability of FPGA,and designs parallel optimization schemes suitable for each module,such as pipelined loop unrolling,data flow optimization,and array partitioning for convolution,pooling,and full connection operations.In terms of the design of the on-chip embedded system,this paper uses the SDK development tool to develop the underlying driver of the CNN acceleration system,completes the complete mapping process from the CNN model to the ZYNQ accelerator,and combines the advantages of the heterogeneous platform to allocate software and hardware resources,reducing the computational pressure of the CPU and memory usage while shortening the inference time of convolutional neural networks.Finally,this paper takes Le Net-5 as the target algorithm and Zed Board as the design platform.The system is tested using the MNIST dataset,and the accelerator is evaluated comprehensively from four aspects:system accuracy,resource consumption,inference speed,and system power consumption.The experiment shows that the accelerator has a computing performance of 5.25 GOPS,completes one forward inference in only 5.14*10-4s,and has a power consumption of 2.6W.It is 106 times more energy-efficient than the general-purpose CPU AMD5600g and 188 times more energy-efficient than the onboard ARM processor,demonstrating its excellent performance. |