Font Size: a A A

Research On Cost-Efficient Cloud Resource Provisioning For Predictable Deep Neural Network Training

Posted on:2021-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:H Y ZhengFull Text:PDF
GTID:2428330620468140Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
To improve the inference accuracy of deep learning(DL)models,the training datasets get larger in size and the deep neural network(DNN)models increase in complexity.Therefore,the traditional single-server based training method and its computing and storage capacity cannot meet the training requirements of large DNN models.In order to guarantee training performance and save training budget,it becomes increasingly compelling to train DNN models in a distributed manner in public cloudsHowever,our motivation experiment of training representative DNN models in Amazon EC2 shows that,there exist sharp fluctuations in distributed DNN(DDNN)training performance.We further analyze our motivation experiment results and identify three intricate factors that cause the DDNN training performance variation.First,the parameter server in the training cluster can easily become a resource bottleneck,due to the mismatch between computing and network resources in the current mainstream public cloud instances,as well as the frequent exchange of model parameters through the network;Second,compared with the local controllable cluster resources,the underlying hardware heterogeneity of public cloud instances can have severe impact on the DDNN training performance.Finally,due to the optimization mechanism of existing distributed deep learning frameworks,there exists an imbalance between computing time and communication time for DDNN training,which eventually causes severe computing resource under-utilizationTo tackle the issues above,this thesis proposes Cynthia,a cost-efficient cloud resource provisioning framework to provide predictable DDNN training performance Specifically,Cynthia first predicts the DDNN training time by establishing a light-weight analytical performance model and a loss function of DDNN training.To achieve an accurate performance prediction,our model incorporates the performance variations of DDNN training caused by resource bottleneck and hardware heterogeneity.Second,based on our performance prediction model,Cynthia further devises a simple yet effective cloud resource provisioning strategy to jointly guarantee DDNN training performance and minimize monetary cost.Finally,Cynthia is implemented on top of Kubernetes by launching a 56-docker cluster in Amazon EC2.Extensive experiments based on our Cynthia prototype demonstrate that,Cynthia can provide predictable training performance while reducing the monetary cost for DDNN workloads by up to 50.6%,yet with acceptable runtime overhead.
Keywords/Search Tags:Cloud Resource Provisioning, Deep Neural Network Training, Predictable Performance, Resource Bottleneck, Resource Heterogeneity
PDF Full Text Request
Related items