| With the coming of the Internet,inflation has never seen appeared in the data generated by human beings.The performance of data centers is also growing at a tremendous pace.These two conditions make the training of deeper structured artificial neural networks possible.The deep neural networks can explore the rich information from the big data generated from the Internet and play more important roles in more and more fields.While deeper structured networks bring about significant gains in precision in many applications,they also create an urgent demand for higher computation capacity at the expense of power consumption.For general-purpose processors,CPUs cannot meet the performance requirements of deep neural networks,while GPUs spend too much power consumption.To this end,specific accelerators have been proposed in order to achieve higher performance and lower energy consumption.For deep learning applications,due to its special computation and memory access characteristics,it is difficult for generalpurpose processors to achieve satisfactory execution efficiency.To solve the mentioned issues,this paper focuses on key techniques for accelerator architecture of deep learning,and explores the architecture that the deep learning computing can be efficiently supported by.In the research,we focus on the balance of the system and strive to improve the efficiency of the computing resources and the bandwidth utilization of the memory access,which ensure the data transmission needed by the computing resources under the limited memory bandwidth and ensure the efficient of the system.The primary contributions and innovations of this paper are as follows:1.The main computing unit of the deep learning accelerator based on a chaining structured matrix multiplier implement is proposed.The architecture can cover more than 90% of the workload of the typical deep learning applications.Chaining structured matrix multiplier works inefficiently when the edge sub-block is too much.To solve this issue,we present a workload sensitive dynamic scaling matrix multiplier structure and an optimized blocking strategy to improve the work efficiency of the chaining structured matrix multiplier.2.According to the computational model of convolutional neural network,a streammapper unit is designed to map the computation of the convolutional layer to the chaining structured matrix multiplier.This method parallels the computation in a fine grained way and reduces the impact of the network structure on the work efficiency of the accelerator.For the reason that the mapping task and matrix multiplication can be overlapped simultaneously,the time overhead of data rearrangement is eliminated.According to the memory access characteristics of convolutional neural networks,a prefetch method is proposed to rearrange the random memory access from convolutional layers to memory access with sequential addresses.This method can reduce the total number of the memory access and ensure a high memory bandwidth utilization rate,thus improves the overall performance of the accelerator.3.A detailed analysis of a CPU/GPU oriented OpenCL implementation is conducted and the bottleneck of this implementation is pointed to the memory access.We explore the impact of the code on the memory access behavior of the OpenCL FPGA and derive some programming rules to improve the memory access efficiency.Based on the results of the analysis,we propose an optimized OpenCL implementation.The implementation achieves a speed 4.76 times that of the original version. |