Font Size: a A A

Research On Systolic Array Based Hardware Accelerator For Convolutional Neural Networks

Posted on:2023-07-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:R XuFull Text:PDF
GTID:1528307169476884Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Convolutional Neural Networks(CNNs)play an important role in the research field of deep learning.Because CNNs have high computational complexity and large data storage requirements,hardware accelerators must be used for network training or inference to achieve higher performance,and the systolic array architecture has gradually become a popular choice for the CNN accelerators.However,with the development of convolutional neural networks in recent years,various CNN model have appeared.When processing these new models,especially the compact convolutional neural network(Compact CNN),the performance of the systolic array is severely degraded.The experimental results show that the small-scale convolution and depthwise convolution in the Compact CNN lead the decrease of the computiong units utilization rate,which seriously affect the performance and efficiency of the systolic array.In order to solve these problems,this dissertation first designs an efficient Systolic Array Performance Simulator(SAPS).The simulator is based on the analysis of the workflow of the systolic array,and focuses on modeling the design parameters of the systolic array such as dataflow and array-size.Under the constraints of the given network model,the various perfomance results of the systolic array can be quickly obtained.Using the simulator can also help to analyze the reasons for the inefficiency of systolic arrays in processing compact CNN networks,and to explore the systolic array design space to find an optimal systolic array design for a specific workload.Since the simulator is based on an efficient analysis model,compared with the current mainstream systolic array simulators,SAPS can shorten a simulation time from hours/days to seconds/minutes;and with the help of iterative algorithms,SAPS can automatically explore the design space and complete the design iteration work efficiently.To address the problem of computing small-scale convolution,this dissertation proposes the design of a Configurable Combinatorial Systolic Array(CCSA).The small-scale convolution problem also reflects the scalability problem of systolic array.For this reason,we first use the SAPS analysis model to find a solution,and determine to improve the small-scale convolution performance of the systolic array by dividing the PE array.After discussing the advantages and disadvantages of scaling-up and scaling-out,which are two mainstream scale methods of systolic arrays,we propose the CCSA.Compared with the scaling-up,CCSA solves the problem of small-scale convolution through the flexible combination of multiple small arrays;and compared with the scaling-out,CCSA can avoid the additional consumption of data traffic and improve the energy efficiency of the systolic array.Compared with the traditional systolic array design,CCSA can increase the computing unit utilization rate in the process of the small-scale convolution by 16%,and the performance is relatively improved by 24%;CCSA can also reduce data traffic consumption by 45% while maintaining the same performance with scaling-out.However,the further analysis of the small-scale convolution problem shows that only when the PE array has the ability of asymmetric segmentation,can the systolic array have a better performance of small-scale convolution.In addition,the CCSA design also fails to solve the problem of depthwise convolution.To this end,this dissertation proposes a Configurable Multi-directional Systolic Array(CMSA)design.CMSA first uses the multi-directional transmission structure to realize the asymmetric segmentation of the PE array.Afterwards,we adopt the software-hardware joint design method and propose a more efficient dataflow —Column Stationary(CS)dataflow for depthwise convolution.Compared with traditional dataflow,it can fully exploit the computational parallelism in depthwise convolution.The CMSA also supports CS dataflow by using a new computing unit structure.Experiments show that in small-scale convolution calculations,compared with the CCSA and traditional systolic arrays,the CMSA can always achieve the best performance,and can increase the computing unit utilization rate by up to 67%;In smallscale convolution calculations,compared with the traditional systolic array designs,the CMSA can increase the utilization of computing units by about 60 per-centage points,and the relative performance can be improved by up to 12.5 times.At the same time,the CMSA still uses the original structure of the systolic array,which maintaining the simplicity and efficiency of the design.Therefore,at an acceptable area cost,it further reduces the energy consumption by up to 20% compared to the traditional design.After further analysis of the process of depthwise convolution by the systolic array,this dissertation finds that the hardware resources of the systolic array are sufficient.The design of the systolic array needs to optimize the dataflow to fully exploit the parallelism and data-reuse in the depthwise convolution.So,we propose a Configurable Heterogeneous Systolic Array(CHSA)design.Using the software-hardware joint design method,we first propose Output Stationary with Single-channel(OS-S)dataflow,which is friendly to the systolic array structure and can accelerate the depthwise convolution.Through the heterogeneous PE design,CHSA can support both of traditional dataflow and OS-S dataflow.In addition,CHSA can also use the CCSA design to achieve better performance in small size convolution and scale designs.Experiments show that compared with the traditional systolic arrays,the CHSA+CCSA design can improve the performance of the deepthwise convolution by up to 11.2 times,and the overall performance of the CNN model is also relatively improved by 3.1 times.CHSA also ensures that the area overhead of the design is basically unchanged compared with the traditional systolic array,and the energy consumption is relatively saved by about 20%.
Keywords/Search Tags:hardware accelerator, systolic array, dataflow, convolutional neural network, small-scale convolution, depthwise convolution
PDF Full Text Request
Related items