| With the maturity and popularization of technologies such as the Internet of Things and Cloud Computing,tens of billions of terminal devices have generated huge amounts of data,which are uploaded to cloud centers with powerful computing power for processing.Massive data and powerful computing power have promoted the rapid development of deep learning algorithms.Deep learning algorithms have been widely used in computer vision,natural language processing,autonomous driving,robot decision-making and so on due to their excellent generalization.They help make smart living and working a reality.Therefore,deep learning algorithms have broad prospects for development.In the traditional intelligent system based on "cloud-end" architecture,deep learning applications are deployed in cloud centers with powerful computing power and good scalability.The massive data generated on the terminal devices needs to be transmitted to the cloud center through the network to support the training and inference of the deep learning model.However,the cloud-centric computing suffers from several problems,such as insufficient real-time performance,limited bandwidth,and privacy leakage.In order to alleviate the problems,edge computing came into being.By undertaking downstream deep learning application services and upstream terminal data processing tasks,edge computing executes deep learning models close to the data generation side,which significantly compensates for the problems exposed by cloudcentric computing.However,edge devices have limited hardware resources.Deploying deep learning models that require large computing and storage resources faces serious challenges.As a computationintensive and memory-intensive task,deep learning algorithm will frequently move data between the processing units and the memory units during runtime.As the amount of data increases,frequent data transfers make higher requirements on the memory bandwidth of edge devices,leading to the "memory wall" problem.Moreover,it consumes a lot of energy,causing the "energy wall" problem.The emergence of Processing-in-Memory(PIM)technology based on resistive random access memory(ReRAM)provides an opportunity to solve above problems.However,the development of PIM technology based on ReRAM is still immature.The special hardware properties at the bottom of the technology are difficult to be perceived by the upper-layer deep learning model,resulting in many unavoidable problems(such as low operating efficiency,lots of invalid computations),making it difficult to give full play to the advantages of performance and energy consumption.Therefore,we rethink the ReRAM-based PIM system for edge-side deep learning algorithms,and conducts research at three levels from deep learning model compression,deep learning execution engine design,and deep learning computing architecture design.(1)In response to the problem of insufficient computing and storage resources required for the deployment of deep learning models on edge devices,we propose a cooperative weight/activation pruning method based on sensitivity analysis.The method simultaneously considers three dimensions of model accuracy,compression ratio and hardware efficiency.It removes redundant weights and activations through a cluster-based DNN weight pattern pruning algorithm and a sparsity-row-based DNN activation pruning algorithm,which compresses the DNN model and prunes invalid computations.Experiments show that the cooperative weight/activation pruning method reduces the storage space by 55%and the computation by 63%on average.(2)In response to the inefficient problem of supporting DNN compression algorithm for ReRAM-based crossbar hardware structure,we propose an ReRAM DNN engine for pruningquantization algorithm.First,we propose a fine-grained patch-aware pruning-quantization joint algorithm to compress DNN models.In order to efficiently support the algorithm,we further propose a configurable single-bit ReRAM DNN execution engine based on mixed operation units.Experiments show that the execution engine enables the models compressed by the finegrained patch-aware pruning-quantization joint algorithm to obtain higher performance and lower energy consumption,and reduce the storage space occupied.(3)In response to the problem of existing invalid computation for ReRAM-based DNN PIM architecture,we propose an ReRAM execution engine by reusing fine-grained DNN weight pattern repetitions.The execution engine mainly includes a weight pattern repetitionaware DNN computing engine and a weight pattern repetition-operation units mapping table to achieve storage space compression and comptation reuse.Experiments show that the execution engine on average improves the performance by 1.73×,reduces the energy by 56.53%,and saves the ReRAM space by 52.95%compared with the advanced ReRAM DNN execution engine.(4)In response to the problem of lacking high-performance and low-power PIM architecture on edge side,we design a highly paralleled ReRAM-based DNN PIM architecture,which provides support for ReRAM execution engine by reusing fine-grained DNN weight pattern repetitions.In addition,we design a six-stage pipeline for the processing engine in the architecture,and realizes the parallel execution between different layers of the DNN model in an asynchronous manner.Experiments show that the highly paralleled ReRAM-based DNN PIM architecture achieves up to 2.74× performance improvement,72%energy consumption reduction,and 70%ReRAM space saving.In summary,from the perspective of system architecture,we make a series of key optimizations in the design and research of ReRAM-based DNN PIM dedicated acceleration architecture by using hardware-software co-design technology.These optimizations achieve the goal of running deep learning algorithms with high performance and low power consumption on the edge side,aiming to promote the development and application of edge intelligence. |