| The era of the Internet of Things has led to explosive growth in the number of mobile end devices,and at the same time,with the continuous development of artificial intelligence technology,more and more intelligent applications need to be deployed on mobile end devices.Deep Neural Network(DNN)models are also widely deployed on mobile devices due to the increasing computing power of end devices.However,as the network structure of DNN models becomes more and more complex and requires more and more computational resources,it is gradually difficult for resource-limited end devices to provide high accuracy and low latency services.One solution is to deploy DNN models directly on edge servers which have more resources and higher computational power,and upload the raw input data directly to the edge servers to perform inference,but this solution not only leads to excessive communication delay,but also has serious problems of user privacy data leakage.To solve the above problem,the researchers proposed a solution to divide the DNN model into two parts at a certain partition point: a head model and a tail model,i.e.,end-edge co-inference.The head model and the tail model are deployed on the mobile end device and the edge server,respectively,and the intermediate feature tensor output from the head model is uploaded to the edge server and used as the input of the tail model to perform the remaining inference tasks.This solution effectively improves the security of usersā private data.However,for some DNN models,the communication delay of transmitting the intermediate feature tensor is larger than that of directly transmitting the raw input data,resulting in a larger total inference latency,and researchers are unable to find a suitable partition point within the model.To reduce the impact of communication delay on the total inference latency,researchers propose to insert a compression unit at the partition point to reduce the size of the transmitted data volume by compressing the size of the intermediate feature tensor.However,since the location of the model partition point and the value of the compression ratio affect the inference accuracy and inference response time of the DNN model,to find the best partitioning solution for the model,the developer needs to traverse and train each compression ratio at each partition point of the model for the current resource environment before deployment.This not only leads to huge training costs,but also makes it difficult to adapt to the dynamic changes of mobile end devices and network bandwidth.To address the above problems,this paper proposes a strategy for dynamic and quick DNN partitioning under multiple constraints,which significantly reduces the training overhead required for model partitioning and adapts to changes in environmental resources.The specific work is as follows.1.In this paper,the relationship between the compression ratio of the compression unit and the inference accuracy of the DNN model is investigated,and it is observed that there is an inflection point on the relationship curve between the compression ratio and the inference accuracy.Before the inflection point,the model accuracy is rarely affected by the change of the compression ratio of the compression unit and remains almost constant;after the inflection point,the model inference accuracy decreases sharply with the increase of the compression ratio value.Moreover,the trends of the relationship curves between the compression ratio of the compression unit and the inference accuracy of the DNN model are similar at different splitting points of the same model.2.Based on the similar relationship and the inflection point phenomenon,this paper proposes an accuracy prediction model based on machine learning.By collecting a small number of training data samples at a partition point to train the accuracy prediction model,the prediction model can quickly estimate the inference accuracy of DNN model corresponding to each compression ratio at that partition point of DNN model.And combined with transfer learning,the relationship between compression ratio and inference accuracy at a single partition point is quickly extended to multiple partition points of the same model using fewer training data samples.This method achieves the inference accuracy of the model corresponding to any compression ratio at different partition points within the same DNN model with low training overhead.3.Based on the above work,this paper proposes a dynamic and fast DNN partitioning strategy to quickly locate the best partitioning solution(insertion position and compression ratio of the compression unit)of the DNN model by considering the state of mobile end devices and edge devices in terms of their computing power and network bandwidth.At the same time,the division solution of the DNN model is adjusted in real-time according to the changes in the network bandwidth and the computational load of the end devices in the current environment.Based on the above work,a system prototype is implemented in this paper to verify the effectiveness of the model partitioning strategy in different platform environments.Compared with previous studies,the model partitioning strategy proposed in this paper can reduce the training overhead by a factor of 41 or more,while the accuracy of the model is only lost by2% or less. |