| In recent years,there has been a great demand for deploying artificial intelligence and machine learning algorithms on edge embedded devices in the industry.The deep neural networks(DNNs)based AI models have proven to be quite effective in many scenarios.However,it’s difficult to meet the throughput and latency requirements if model inferences are solely processed by traditional CPUs and GPUs on embedded platforms due to tight constrains on cost,power consumption and memory bandwidth.The neural processing units(NPUs)are domain-specific hardware tailored to DNN inference.They can efficiently process DNN related operations at a very low power and hardware cost,thus are suitable for integrating in embedded SoCs to support edge AI applications.Compared with PC or server CPU/GPU platforms with mature software and hardware ecosystem,there are many challenges in deploying DNN models to edge embedded NPU platforms.This thesis focuses on DNN models deployment on the edge,and carry out works on the domestic VeriSilicon Vivante NPU platforms.Our main innovations and works are as follows:1.Deploying DNN models from the training side to the embedded NPU side often presents these difficulties:wide gap across intermidiate representations,need for many manual model conversions and operator adaptations,low-level and cross-platform development and debugging.To solve these problems,this thesis proposes an end-to-end DNN model deployment system based on the Apache TVM machine learning compiler stack.The system comprises a compiler side and a runtime side,both are adapted for the Vivante NPU and can harness the NPU’s computational power efficiently.The system also exposes easy-to-use model importation,conversion and remote evaluation interfaces for users.It can be integrated with major deep learning frameworks like PyTorch.2.Embedded NPUs generally only provides low-bit integer(e.g.,INT8)computational power,so it’s necessary to quantize trained DNN models with FP32 floating point numeric representations to INT8 models in order to be deployed to NPUs.Unfortunately,the quantization tools provided by NPU hardware vendors are usually cumbersome to use,slow,and can introduce significant accuracy degradations.This thesis proposes a model quantization tool based on the FX computation graph transformation module in the PyTorch framework.It can utilize model training infrastructures readily available in PyTorch like dataset loading and GPU acceleration.It can directly process DNN models built in PyTorch,carrying out auto quantization and calibration.The quantized model can then be passed to the proposed end-to-end deployment system for later deployment and accuracy loss evaluation.The proposed system can achieve end-to-end deployment of DNN models from the PyTorch training framework to the embedded platforms equipped with Vivante NPUs.The correctness of the system is verified by experiments on various models commonly used in the field of computer vision.We also compared the system with several existing model deployment and quantization frameworks.The experimental results show that the proposed system can outperform existing frameworks in terms of usability,accuracy,and inference speed. |