Font Size: a A A

A DNN Inference Acceleration Method For Resource-Constrained Process-in-Memory Chips

Posted on:2024-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:X GaoFull Text:PDF
GTID:2568306923470504Subject:Network and information security
Abstract/Summary:PDF Full Text Request
Conventional computer systems use a von Neumann architecture with separate processor and memory.Due to the limitation of communication bandwidth,data movement between the processor and memory becomes a performance bottleneck when running memory-intensive programs.Processing-in-Memory(PIM)-based accelerator can perform in-situ computation and eliminate data movement between memory and processor,which is an effective solution to the above performance bottleneck.Among various PIM devices,Resistive Random Access Memory(ReRAM),which has similar access performance to DARM and supports in-place matrix vector multiplication operations,has been widely explored in accelerating Deep Neural Network(DNN).Most existing studies of ReRAM-based DNN accelerator assume that all weights of a DNN can be programmed into the crossbar of ReRAM at once.However,given the current technology scaling,the ReRAM PIM chip capacity on an area constrained embedded or edge devices(typically dozens of Mbits)is substantially smaller than the weight size of current DNN models(e.g.,548MB for VGGNet).Therefore,it is impractical to deploy the entire weight of the network on ReRAM offline before inference computation.Instead,partial weights of the neural network need to be deployed one at a time to complete the inference of one input image by multiple deployments with non-negligible programming latency.Some recent works have considered the limitation of PIM resources and proposed solutions to reuse the weights already programmed onto the chip to batch process images.However,no work has yet discussed how to exploit weight similarity to reduce weight programming overhead.The goal of this paper is to design a programming latency-aware DNN inference framework for resource-constrained ReRAM devices.The framework statically plans and schedules the weight blocks of the neural network based on the device resource situation in order to reduce the weight programming latency in the online phase and thus optimize the overall inference latency.To achieve the above goal,the following challenges need to be addressed:1)Redesigning the mapping relationship from DNN weights to each operation unit(OU)of the ReRAM chip to seek the maximum per-bit reuse benefit;2)How to properly activate the statically planned weight blocks at runtime to achieve accurate and efficient inference of DNNs.To address the first challenge,we categorize the various writing characteristics of ReRAM crossbar discussed in the literature work,and conduct empirical studies to identify the potential impact on the DNN weight programming latency.And models the programming latency of ReRAM and designs a hierarchical optimization strategy for the proposed weight programming-aware framework.To address the second challenge,this paper customizes the corresponding OU scheduler to ensure the accurate and efficient operation of DNN inference.The proposed framework is evaluated with five standard DNN models and five natural language processing models.The evaluation results show that the static scheduling strategy proposed in this paper achieves significant speedups,with an overall latency optimization of up to 52.91%compared to state-of-the-art techniques.
Keywords/Search Tags:PIM, ReRAM, resource-constrained, DNN accelerator
PDF Full Text Request
Related items