Research On Cost-Efficient Cloud Resource Provisioning For Predictable Deep Neural Network Training

Posted on:2021-04-22

Degree:Master

Type:Thesis

Country:China

Candidate:H Y Zheng

Full Text:PDF

GTID:2428330620468140

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

To improve the inference accuracy of deep learning(DL)models,the training datasets get larger in size and the deep neural network(DNN)models increase in complexity.Therefore,the traditional single-server based training method and its computing and storage capacity cannot meet the training requirements of large DNN models.In order to guarantee training performance and save training budget,it becomes increasingly compelling to train DNN models in a distributed manner in public cloudsHowever,our motivation experiment of training representative DNN models in Amazon EC2 shows that,there exist sharp fluctuations in distributed DNN(DDNN)training performance.We further analyze our motivation experiment results and identify three intricate factors that cause the DDNN training performance variation.First,the parameter server in the training cluster can easily become a resource bottleneck,due to the mismatch between computing and network resources in the current mainstream public cloud instances,as well as the frequent exchange of model parameters through the network;Second,compared with the local controllable cluster resources,the underlying hardware heterogeneity of public cloud instances can have severe impact on the DDNN training performance.Finally,due to the optimization mechanism of existing distributed deep learning frameworks,there exists an imbalance between computing time and communication time for DDNN training,which eventually causes severe computing resource under-utilizationTo tackle the issues above,this thesis proposes Cynthia,a cost-efficient cloud resource provisioning framework to provide predictable DDNN training performance Specifically,Cynthia first predicts the DDNN training time by establishing a light-weight analytical performance model and a loss function of DDNN training.To achieve an accurate performance prediction,our model incorporates the performance variations of DDNN training caused by resource bottleneck and hardware heterogeneity.Second,based on our performance prediction model,Cynthia further devises a simple yet effective cloud resource provisioning strategy to jointly guarantee DDNN training performance and minimize monetary cost.Finally,Cynthia is implemented on top of Kubernetes by launching a 56-docker cluster in Amazon EC2.Extensive experiments based on our Cynthia prototype demonstrate that,Cynthia can provide predictable training performance while reducing the monetary cost for DDNN workloads by up to 50.6%,yet with acceptable runtime overhead.

Keywords/Search Tags:

Cloud Resource Provisioning, Deep Neural Network Training, Predictable Performance, Resource Bottleneck, Resource Heterogeneity

PDF Full Text Request

Related items

1	Research On Interference-aware GPU Resource Provisioning For Predictable DNN Inference
2	Research On Resource Provisioning Mechanisms For Cloud Services Based On Service Selection
3	Scheduling Methods For Cloud Workflows With Various Resource Provisioning Manners
4	Research On Cost-Aware Virtual Resource Provisioning Mechanisms For Cloud Services
5	Research On Performance Guarantee Of Distributed Dnn Training With Serverless Architectures
6	The Research Of A Multi-Objectives Dynamic Hybrid Cloud Resource Provisioning Mechanism
7	Research On Energy Efficiency Oriented Virtualized Resource Provisioning Method In Cloud
8	Provisioning Heterogeneous Spot Instances For Predictable Distributed DNN Training In The Cloud
9	Resource Provisioning Methods For Cloud Workflow Applications
10	Research On Wireless Virtual Network Resource Allocation Based On Deep Reinforcement Learning