Guided by Moore’s Law,the storage and computing capabilities of computer systems have experienced leaps and bounds in the past few decades,thus providing fertile soil for the development of big data and artificial intelligence.Especially in recent years,the rapid development of neural networks and their successful applications in computer vision,natural language processing,social networks,knowledge graphs,and other fields have truly promoted the prosperity of artificial intelligence,making production more efficient and life more convenient.As neural networks are applied to more applications,neural network models become more and more complex,and the data processed also expands from Euclidean image and video data to non-Euclidean graph data structures.Meanwhile,as the process technology approaches the physical limit,the increase in the computing and storage capabilities of chips is gradually slowing down.In the post-Moore era,the growth rate of computing and storage capacity requirements for emerging neural network applications has outpaced the development of computing chips and storage chips.Therefore,emerging neural network applications are encountering more serious computing and memory bottlenecks on traditional computer systems,resulting in low efficiency and limited deployment.For the new challenges of emerging neural networks in computing and storage and the deployment requirements under different application scenarios,this dissertation carries out research work from three aspects:addressing the acceleration needs of Euclidean three-dimensional Convolutional Neural Network(3D CNN)and non-Euclidean Graph Neural Network(GNN)based on the FPGA(Field Programmable Gate Array)platform,and addressing the memory bottleneck of GNN applications on large-scale datasets based on the emerging Near-Memory Processing(NMP)architecture.This dissertation adopts the hardwaresoftware co-design method,which systematic studies the optimization of neural network algorithm execution,the innovative design of hardware acceleration architecture,and the coupling optimization of software and hardware systems,to improve the computing and memory efficiency of the system,so as to achieve high efficient neural network inference acceleration.The main works and innovations of this dissertation are as follows:(1)This dissertation proposes an FPGA-based 3D CNN accelerator,3D-NPU(Neural Processing Unit),which adopts the methodology of hardware-software codesign,explores the design space under different optimization goals,and achieves leading performance and Processing Element(PE)utilization.At the software level,this work proposes a coarse-grained data tiling method and different loop ordering strategies according to the memory access and computing features of 3D CNNs.At the hardware level,the 3D-NPU adopts a scalable PE array and a reconfigurable on-chip cache design to realize different loop ordering strategies with high flexibility,parallelism,and scalability.At the system level,this work proposes a design space exploration method under different optimization goals to achieve the optimal loop ordering strategy for each network layer.Experimental results show that 3D-NPU can reduce off-chip memory access by 84%and energy consumption by 55%compared with the baseline model,while achieving the highest performance and computational efficiency compared to prior works.(2)This dissertation proposes an FPGA-based GNN accelerator,FP-GNN,which adopts the methodology of hardware-software co-design to achieve a highly flexible and efficient GNN acceleration.At the software level,this work quantitatively analyzes the impact of the execution order of GNN algorithms on performance,and proposes an adaptive hierarchical graph partitioning method for the processing needs of large-scale GNN datasets to improve the efficiency of memory subsystems and eliminate the graph repartitioning overhead between layers.At the hardware level,FP-GNN adopts a unified computing architecture design to achieve flexible execution order and efficient on-chip resource utilization.At the system level,this work optimizes the efficiency of storage and computing systems from the perspectives of graph data access optimization,load balancing,sparsity elimination,and mixed execution.Experimental results demonstrate that the average performance and energy efficiency of FP-GNN on various GNN models and datasets reach 24.9 times and 138 times that of GPU(Graphic Processing Unit),respectively.FP-GNN also achieves state-of-the-art performance efficiency and energy efficiency compared to prior works.(3)This dissertation proposes a DIMM(Dual In-line Memory Module)based NMP accelerator,G-NMP.The hardware-software co-optimization for GNNs is realized to achieve practical and efficient NMP acceleration.At the software level,this work reduces the design complexity of the NMP architecture by extracting fine-grained basic operators from various GNN algorithms.At the hardware level,G-NMP adopts a unified computing module to achieve flexible data flow,and utilizes rank-level parallelism to improve parallel memory access bandwidth.At the system level,this work designs a instruction set architecture for G-NMP,G-ISA(Instruction Set Architecture),to achieve efficient deployment of GNN algorithms.Meanwhile,this work proposes an adaptive data allocation strategy and an accumulation mode optimization method to improve the memory access and computing efficiency of G-NMP.In addition,this work also proposes a communication strategy between the CPU and the G-NMP accelerator to achieve lowoverhead memory ownership transition.Experimental results show that under the same memory configuration,the average performance of the G-NMP accelerator is 1.73 times that of state-of-the-art works,and the energy efficiency is 31.6 times and 1.78 times that of CPU and GPU,respectively. |