A Remote Retraining Framework For AIoT Devices With Computing Errors

Posted on:2022-05-24

Degree:Master

Type:Thesis

Country:China

Candidate:M He

Full Text:PDF

GTID:2518306560979459

Subject:IC Engineering

Abstract/Summary:

PDF Full Text Request

In recent years,AIoT devices combining neural network accelerators and IoT devices are widely used in several fields.However,both the smaller transistor size and lower power supply increase the probability of soft errors occurring in AIoT processors,leading to lots of computational errors in neural network accelerators.In this case,if the offline trained neural networks are deployed directly to the accelerators,huge loss of prediction accuracy would be induced.Traditional fault-tolerant techniques(e.g.,triple modular redundancy)suffer from energy inefficiency and performance loss.Therefore,researchers at home and abroad have found that neural networks can be retrained to improve the fault tolerance of network models by investigating the characteristics of neural networks themselves.However,neural network training on general-purpose processors such as CPUs and GPUs cannot capture the computational errors of accelerators well.To address this issue,the works done in this thesis are summarized as follow:In this thesis,a remote retraining method for remote AIoT processors with computational errors is proposed.The remote AIoT processor with soft errors is introduced in the training loop so that the application data on the server can pick up the live computational errors.Thus,the retrained models are resilient to the soft errors.Furthermore,a series of client-server communication APIs are designed to facilitate the implementation of remote retraining methods on traditional training frameworks(e.g.,Py Torch)and IoT software stacks.In this thesis,the remote retraining methods are further optimized in terms of total training time and model accuracy.Specifically,a sparse incremental compression method is proposed to reduce the amount of data to be transmitted during retraining.Additionally,a local three-mode redundancy protection strategy based on a heuristic algorithm is presented to improve the accuracy of the retrained model,which can reduce the fault tolerance cost with guaranteed accuracy loss.In this thesis,the remote retraining methods are implemented and experiments are performed on a set of typical neural network models.The experimental results show that the remote retraining method can improve the top-5 accuracy of models by 1.79%～15.04%compared with the offline training neural network model when the performance loss is0%～200%.Compared with the direct remote retraining method,38%～91% of the retraining time can be reduced by the sparse incremental compression method with negligible accuracy loss.

Keywords/Search Tags:

AIoT devices, neural network accelerator, fault tolerance, collaborative training

PDF Full Text Request

Related items

1	The Training System Of Fault-tolerant Neural Network Model Based On CPU-FPGA
2	Research And Implementation Of Fault-tolerant Technologies For Quantized Neural Networks
3	Damage Analysis And Control Of Neural Network Accelerator Under Space Irradiation
4	Research On CNN Accelerator Soft Error Approximate Fault-tolerance Strategy
5	The Design Of Autonomous Opportunistic Protection Mechanism In Neural Network Accelerator Architecture
6	A High-reliability Deep Neural Network Accelerator With Hybrid Architecture
7	Design Of Neural Network Accelerator For Portable Applications
8	Research On Key Technologies Of Real-time OCR For AIoT Chips
9	Research And Implementation Of Fault Tolerance Enhancement Technology For Deep Neural Networks
10	Research On Optimization Of AIoT System With Branchy Neural Network