Research On Compression Algorithm Of Neural Network Based On Combined Ternary Quantization

Posted on:2021-10-26

Degree:Master

Type:Thesis

Country:China

Candidate:P L Yang

Full Text:PDF

GTID:2518306122462644

Subject:Mechanical engineering

Abstract/Summary:

PDF Full Text Request

In recent years,deep learning has once again set off a wave of artificial intelligence technology and has been successfully applied in many fields.However,deep learning models represented by convolutional neural networks have huge parameter scales and need high computational costs,which strongly rely on highperformance computing devices such as GPUs and even GPU clusters.This severely limits the deployment and application of deep learning models in edge computing scenarios with limited hardware resources,so model compression of deep neural networks has become a current research hotspot.Quantization is one of the most effective compression methods.Using low-precision values to replace the original floating-point parameters can effectively reduce the storage of model parameters and energy consumption of computation.When the weight and activation of the neural network are quantized to 1bit or 2bit,the acceleration effect is most significant.However,the lower the number of quantization bits,the greater the accompanying calculation errors,and the errors will also be accumulated layer by layer in the forward calculation and back propagation of the neural network,thus inevitably causing a serious loss of accuracy.In response to this problem,it is of great significance to adopt a reasonable quantization strategy to strike a balance between algorithm versatility,compression capability,and accuracy degradation.The quantization algorithm based on the ternary quantization of weight and the fixed-point quantization of activation has been proposed,and the main aspects are listed as follows:(1)Combined ternary quantization of weight is proposed,which use the sum of the products of multiple scaling factors and ternary weight to quantize the weight of the convolutional layer.Compared with direct quantization,binary or ternary weight with single scaling factor can reduce quantization errors.Although the combined ternary quantization will increase a little parameter and calculation amount,it can break through the limitation of single quantization weight and has better fitting effect.(2)Based on 2-bit fixed-point quantization,it is proposed to use box plots to calculate the data distribution of activation tensor,and to cut out the outliers.It is studied that with the direct fixed-point quantization of activation,there may be some outliers with large values,which resulting in a large amount of information being lost after quantization.This method can make the distribution of data before quantization more uniform and centralized and keep the quantization error within the normal range.(3)Integrating the quantization strategies of weight and activation,the quantization architecture of convolutional model is proposed.According to the backpropagation algorithm,the complete training process of the quantization architecture is given,and the relevant details in the training algorithm are introduced.In the inference computation of model,most floating-point operations can be converted to operations of fixed-point integer,which is more efficient in processor.In the task of image recognition,comparing the prediction accuracy with the original floating-point model and other quantization models,it proves that the quantization algorithm can effectively reduce the accuracy degradation while ensuring versatility and compression capability.

Keywords/Search Tags:

Neural network, Model compression, Fixed-point quantization, Combined ternary quantization

PDF Full Text Request

Related items

1	Research On Key Problems Of Fixed-point For Convolutional Neural Network
2	Fixed-Point Inference Of Neural Image Compression
3	Study Of Mixed Precision Quantization Of Convolution Neural Network
4	Study Of Low Bit-width Quantization Of Deep Convolutional Neural Network
5	Deep Neural Network Compression Method Based On Product Quantization
6	Research On Application Of Neural Network Compression And Acceleration Based On Quantization
7	Mixed-precision Quantization Methods For Convolutional Neural Network Compression
8	Research On Accelerating Algorithm Of Neural Network Based On Quantization
9	Research And Application Of Neural Network Quantization Aware Training Methods
10	Context Modeling And Vector - Scalar Quantizer Of The Ecg Signal Compression