| Neural network-related research is a hot issue in the field of computer research.Analyzing the development of this field,it is not difficult to find that the computational load of neural network models is constantly increasing.With the development of mobile computing devices,the inference of neural network is taking place increasingly in the edge computing scenarios,however,models with much deeper layers still affect the power consumption of embedded computing devices and the real-time ability of tasks.In this context,the research related to neural network accelerators is advancing rapidly,and FPGA-based accelerators have better energy efficiency than traditional GPUs,and thanks to the programmability of FPGAs,FPGAbased accelerators can upgrade quickly to adapt to the rapidly evolving neural network models.There are two main types of FPGA-based neural network accelerators.One is to explore the design space of hardware accelerators for certain neural network model.The other is an instruction-driven general-purpose neural network accelerator.The hardware structure of this accelerator is relatively fixed.It relies on the automatic exploration in the neural network compilation to generate instructions,and the accelerator executes the instructions to complete various neural network tasks.General-purpose neural network accelerators are easier to generalize in practical applications,and they are more flexible to adapt to the changes in tasks.Therefore,there are many researches on optimizing the performance of general-purpose neural network accelerators.The most direct way to improve performance is to design more efficient computing units and explore neural network compilation schemes.This thesis makes a detailed analysis of the single-core performance and multi-core parallel execution performance of an FPGA-based neural network accelerator.Based on performance analysis,this thesis proposes two execution strategies for multi-core convolutional neural network accelerators.These two execution strategies alleviate the problems of memory access delay and low utilization of computing power faced by multi-core neural network accelerators,even without relying on advances in computing equipment.Through reasonable task scheduling,the processing efficiency of tasks can also be improved on the existing software and hardware platforms.At the same time,the research of this topic will also bring inspiration to the improvement of computing equipment and compilation tool chain.This thesis takes the XILINX DPU(Deep learning Processing Unit)IP core as the research object,analyzes its single-core performance on the scale of model nodes,summarizes the correlation between calculation amount and memory access of model nodes,and concludes the variation pattern of computational power utilization of different model nodes on different configurations of DPU cores.This thesis also analyzes the performance of DPU’s multi-core parallel execution,and mainly studies the memory access delay effect.Using a large amount of data collected during performance analysis,a machine learning-based model to predict the node execution time is constructed.This thesis proposes a high memory access bandwidth staggered execution strategy to alleviate the effect of memory access delay and a heterogeneous multi-core segmented pipeline execution strategy to improve the utilization of computing power.The experimental results show that the error of the execution time prediction model is 3.64%,the high memory access bandwidth staggered peak execution strategy can bring 5.79%performance improvement,and the heterogeneous multi-core segmented pipeline execution strategy can bring 13.5%performance improvement. |