In recent years,the artificial intelligence industry has developed rapidly,especially the deep neural network.However,the cost of the excellent performance of neural network is the complex network architecture and huge computing.Simply pursuing network performance without considering the computing cost will greatly limit the application scenarios in which it can be deployed.Data quantization is a mainstream research direction of neural network deployment.Network is compressed and accelerated by reducing the bit width of the data representation.It does not need to change the original network architecture,and is hardware friendly.However,the existing methods also have some problems.Firstly,the current quantization aware training methods need to access the original data set and determine the quantization bit width,but these two prerequisites may not be met in practice.Secondly,the existing quantization algorithms still have a lot of redundancy in network compression and acceleration,and the efficiency of network inference can be further improved through algorithm architecture co-design.To solve the above problems,this paper makes the following innovative research results:·For the two scenarios of no training data and uncertain quantization bit width in quantization aware training,this work propose data free quantization aware fine-tuning(DFQF)and dynamic precision onion quantization(DPOQ)respectively.The data-free quantization method DFQF first trains a generator network according to the output of the full precision model,and then generate fake training data set through the generator network,distilling the knowledge of the full precision model to the quantized model.The network trained by the dynamic precision quantization DPOQ method can adapt to a variety of quantization bit widths.The high-precision network in the network structure will reuse the parameters and intermediate results of the low-precision network.Users can change the network quantization precision according to the computing power of the deployed equipment without retraining the network model.·This paper proposes structured term pruning(STP)algorithm and hardware architecture,and continues to explore quantization redundancy at the fine-grained bit level.In terms of software algorithm,the network parameters are easier to update to the values with fewer non-zero terms through guiding training to improve the bit sparsity.And the calculation is structured with groups.In the hardware architecture,the bit serial multiply accumulator is used.During the calculation,the invalid multiply 0 operation is skipped to improve the inference efficiency of the neural network,and multiple bit serial multiply accumulators are formed into an interleaved processing element to achieve the throughput similar to that of the parallel multiply accumulator.The computing energy efficiency is improved by 2.35 x on the ResNet18 model of the Imagenet dataset.·In order to dynamically change the quantization precision according to different input images,this paper proposes two kinds of algorithm and hardware architectures of dynamic precision with different granularity,namely sample-wise dynamic precision(SWDP)and structured dynamic quantization(SDP).SWDP adopts the network structure of DPOQ and its multi-precision and easy switching characteristics.The output confidence of low precision network is used as a criterion to judge the difficulty of input samples,and appropriate quantization precision is allocated to different samples.SDP divides the quantized data into high and low parts,and select a fixed number of important part data for calculation through non-zero Top-k method to reduce the amount of computation in a structured way.In terms of hardware architecture,SDP implements the corresponding processing unit to support dynamic selection of operation data,and designs a full pipeline Top-k hardware engine that adapts to the skew output of the systolic array to identify the importance of the feature map.SDP achieves 29%performance improvement and 51%energy saving comparing to other existing dynamic precision accelerators. |