| With the optimization and upgrading of ARM architecture and the massive popularity of mobile intelligent terminals,the application of deep learning technology in real life has become more and more demanding.These devices often have very limited memory and computing resources,and mobile applications require extremely high real-time,security,and accuracy.It is difficult to deploy convolutional neural network model on mobile terminal to realize fast calculation of network forward reasoning with high accuracy.Therefore,some lightweight convolutional neural network models with small parameters and high performance have emerged.Aiming at the problems of target detection lightweight network deployment in arm mobile terminal,this paper improves the model and optimizes the parallel computing,and proposes an effective inference acceleration scheme.The main investigations of this paper are as follows:The target detection network model YOLOv3 is studied and improved.The YOLOv3 algorithm based on one-stage is used as the basic network for multi-scale target feature detection.The theoretical acceleration effect of the deep separable convolution structure used in the lightweight network Mobile Net V1 is analyzed.In order to reduce the amount of parameters,Mobile Net_YOLOv3 is proposed.Mobile Net V1 is selected to replace the main network Darknet53 in the original YOLOv3 algorithm.The forward reasoning of the model is accelerated and the model is deployed quantitatively.The convolutional parallel acceleration calculation process is completed for the ARM architecture processor.The ARM NEON single instruction multiple data stream is applied to the general matrix multiplication,and the matrix multiplication is quickly calculated by dividing and rearranging the matrix.The calculation methods such as “Im2col+GEMM” and Winograd are used to accelerate the calculation of the convolutional layer.To complete the parallel operation,the convolution kernel and the input eigenvalues are matrix rearranged,and the data is stored in the matrix storage type of NC4HW4.The quantization scheme of network model in reasoning stage is compared,and the linear quantization process is improved.Through the analysis,it is found that the BF16 quantization scheme can improve the speed by a small margin,while the accuracy will not decrease significantly.Using INT8 linear quantization scheme,the weights are quantized symmetrically,and the activation values are quantized asymmetrically.The speed can be greatly improved,but the accuracy is obviously reduced.To improve the quantization scheme,Pearson correlation coefficient is used to calculate the data points difference of the output characteristic matrix before and after quantization.To improve the inference speed further,the method of 7-bit quantization activation value and 6-bit quantization weight is used.Finally,the hardware platform of the proposed optimization algorithm is deployed and verified.The convolution parallel acceleration experiment of the Mobile Net V1 is carried out to prove the superiority of NC4HW4 data arrangement by the RK3399.Then,the improved Mobile Net_YOLOv3 model is verified by experiments.The forward inference calculation is carried out for the models with data types FP32,BF16 and INT8 respectively.The quantitative method is improved,and the effectiveness of the method is verified by experiments. |