| In recent years,deep learning has once again set off a wave of artificial intelligence technology and has been successfully applied in many fields.However,deep learning models represented by convolutional neural networks have huge parameter scales and need high computational costs,which strongly rely on highperformance computing devices such as GPUs and even GPU clusters.This severely limits the deployment and application of deep learning models in edge computing scenarios with limited hardware resources,so model compression of deep neural networks has become a current research hotspot.Quantization is one of the most effective compression methods.Using low-precision values to replace the original floating-point parameters can effectively reduce the storage of model parameters and energy consumption of computation.When the weight and activation of the neural network are quantized to 1bit or 2bit,the acceleration effect is most significant.However,the lower the number of quantization bits,the greater the accompanying calculation errors,and the errors will also be accumulated layer by layer in the forward calculation and back propagation of the neural network,thus inevitably causing a serious loss of accuracy.In response to this problem,it is of great significance to adopt a reasonable quantization strategy to strike a balance between algorithm versatility,compression capability,and accuracy degradation.The quantization algorithm based on the ternary quantization of weight and the fixed-point quantization of activation has been proposed,and the main aspects are listed as follows:(1)Combined ternary quantization of weight is proposed,which use the sum of the products of multiple scaling factors and ternary weight to quantize the weight of the convolutional layer.Compared with direct quantization,binary or ternary weight with single scaling factor can reduce quantization errors.Although the combined ternary quantization will increase a little parameter and calculation amount,it can break through the limitation of single quantization weight and has better fitting effect.(2)Based on 2-bit fixed-point quantization,it is proposed to use box plots to calculate the data distribution of activation tensor,and to cut out the outliers.It is studied that with the direct fixed-point quantization of activation,there may be some outliers with large values,which resulting in a large amount of information being lost after quantization.This method can make the distribution of data before quantization more uniform and centralized and keep the quantization error within the normal range.(3)Integrating the quantization strategies of weight and activation,the quantization architecture of convolutional model is proposed.According to the backpropagation algorithm,the complete training process of the quantization architecture is given,and the relevant details in the training algorithm are introduced.In the inference computation of model,most floating-point operations can be converted to operations of fixed-point integer,which is more efficient in processor.In the task of image recognition,comparing the prediction accuracy with the original floating-point model and other quantization models,it proves that the quantization algorithm can effectively reduce the accuracy degradation while ensuring versatility and compression capability. |