Font Size: a A A

A Remote Retraining Framework For AIoT Devices With Computing Errors

Posted on:2022-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:M HeFull Text:PDF
GTID:2518306560979459Subject:IC Engineering
Abstract/Summary:PDF Full Text Request
In recent years,AIoT devices combining neural network accelerators and IoT devices are widely used in several fields.However,both the smaller transistor size and lower power supply increase the probability of soft errors occurring in AIoT processors,leading to lots of computational errors in neural network accelerators.In this case,if the offline trained neural networks are deployed directly to the accelerators,huge loss of prediction accuracy would be induced.Traditional fault-tolerant techniques(e.g.,triple modular redundancy)suffer from energy inefficiency and performance loss.Therefore,researchers at home and abroad have found that neural networks can be retrained to improve the fault tolerance of network models by investigating the characteristics of neural networks themselves.However,neural network training on general-purpose processors such as CPUs and GPUs cannot capture the computational errors of accelerators well.To address this issue,the works done in this thesis are summarized as follow:In this thesis,a remote retraining method for remote AIoT processors with computational errors is proposed.The remote AIoT processor with soft errors is introduced in the training loop so that the application data on the server can pick up the live computational errors.Thus,the retrained models are resilient to the soft errors.Furthermore,a series of client-server communication APIs are designed to facilitate the implementation of remote retraining methods on traditional training frameworks(e.g.,Py Torch)and IoT software stacks.In this thesis,the remote retraining methods are further optimized in terms of total training time and model accuracy.Specifically,a sparse incremental compression method is proposed to reduce the amount of data to be transmitted during retraining.Additionally,a local three-mode redundancy protection strategy based on a heuristic algorithm is presented to improve the accuracy of the retrained model,which can reduce the fault tolerance cost with guaranteed accuracy loss.In this thesis,the remote retraining methods are implemented and experiments are performed on a set of typical neural network models.The experimental results show that the remote retraining method can improve the top-5 accuracy of models by 1.79%~15.04%compared with the offline training neural network model when the performance loss is0%~200%.Compared with the direct remote retraining method,38%~91% of the retraining time can be reduced by the sparse incremental compression method with negligible accuracy loss.
Keywords/Search Tags:AIoT devices, neural network accelerator, fault tolerance, collaborative training
PDF Full Text Request
Related items