Font Size: a A A

Collaborative Allocation Of Storage And Compute Resources For Deep Learning Clusters In Cloud

Posted on:2024-08-17Degree:MasterType:Thesis
Country:ChinaCandidate:M X LiFull Text:PDF
GTID:2568306929490274Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The separation of compute and storage in modern cloud services eases the deployment of general applications and thus is widely adopted by industry.However,with the development of accelerators such as GPU/TPU,IO bottleneck occurs frequently in deep learning scenarios due to the gap between the speed of pulling data from the storage service and deep learning model training,which leads to the severe under-utilization of the expensive computing units.For the deep learning loads in above scenario,we need to do the trade-off between increasing the size of local cache and upgrading the bandwidth of storage service to alleviate IO bottleneck.And since compute resources are scalable and can be affected by IO bottleneck,the collaborative allocation of computing resources also matters.It is full of challenges to design the best storage and compute resource allocation strategy.The preference of deep learning training models for cache and bandwidth is heterogeneous.Some lightweight models with smaller datasets and faster training speeds prefer cache,while models with larger datasets and slower training speeds prefer bandwidth.Meanwhile,the dataset can be shared among multiple jobs and dynamic GPU scaling can have heterogeneous affect on the throughput of different deep learningtraining jobs.In this work,based on the sub-linear bandwidth pricing model for storage service and modeling of the actual throughput of different deep learning jobs,we propose CBA,a cache and bandwidth allocation scheme for heterogeneous jobs.It exploits the job characteristics based on their training throughput,dataset size and scalability.And this paper proves theoretically that CBA is optimal for the coordinated allocation of bandwidth and cache is optimal when ignoring cache latency,which can minimize the cost at storage level for eliminating IO bottleneck.For jobs with a fixed GPU allocation,CBA can minimize the training cost.On the basis of CBA,we consider the heterogeneity of the sensitivity of deep learning jobs to their completion time and study the collaborative allocation of storage and compute resources,and further derive a compute resource allocation scheme AutoCBA for training acceleration under a given budget.For clusters that can automatically scale the GPU allocations of jobs,Aut o CBA supports diverse job utility functions and improves social welfare within a limited budget.Using the dataset from the mainstream cloud service provider we cooperate with in real production environments,we conduct our large-scale simulations.Experimental results show that CBA can reduce the storage level cost by up to 20.5%compared with existing baselines.Compared with state-of-the-art deep learning allocation procedures,Aut o CBA can still improve the total social welfare by up to 2.27 times even when the baselines are enhanced to promote fairness.Within a given budget,Auto CBA can reduce the average job completion time by up to 8.9%.And under different budget settings,AutoCBA always outperforms the baselines.
Keywords/Search Tags:Deep learning training, Cloud computing, Separate compute and storage, IO bottleneck, Cache, Bandwidth, Resource allocation
PDF Full Text Request
Related items