| Artificial intelligence,as a hot topic in the field of computational science,gradually moves from research to industrial application.Convolutional neural network is a kind of feedforward neural network.In recent years,it has achieved remarkable results in the field of deep learning.It has been successfully applied to image and video recognition,natural language processing,and intelligent advertising push,and its error rate is much better than traditional methods.Deep learning,especially in the area of neural networks,has a huge amount of data and a complicated calculation process.It has a high requirement for the data bandwidth and computing power of the hardware platform,and the requirements for power consumption in the mobile portable field are especially stringent.These scenes limit its application effect.In this paper,a dedicated accelerator for convolutional layer convolutional neural networks is proposed.An internal dedicated data mapping method is designed to solve the problem of parallel computing between different feature map channels and different convolution kernels in the convolutional layer,and reduce the complexity of control implementation.For the different data sizes of the image input layer and the intermediate convolution layer,two corresponding working modes are designed.To balance storage bandwidth and computing speed,the convolutional layer is subdivided into finer-grained tasks and ping-pong execution to cover the transfer time from external storage to on-chip ram.Reusing the compute arrays in a reconfigurable manner can reduce the resources required and make full use of the hardware resources.The MAC(Multiply and Accumulate)unit can be reconstructed into four 64-channel multiply accumulate trees and 64 4-channel multiply accumulate trees to adapt to different convolutional layer data sizes.Taking the classic AlexNet’s five-layer convolutional layer as the test set,we run the behavioral simulation on the entire accelerator based on the realistic IO speed.Because of the ping-pong operation,the computing accounted for 87.8%of the entire operation,and the data transfer time accounted for 12.2%.Most of the transfer time was hidden by calculation.So the IO bandwidth can match the calculation speed of the entire system.The performance of the FPGA-based system can reach 57.7 GOPS at 160 MHz,and the average utilization of MAC unit is 70.5%.The design method of the CNN convolution accelerator in this paper is extensible and has good reference value and guiding meaning for other similar designs. |