| In recent years,machine learning has played a critical role in many fields.However,due to the pursuit of better training performance,people train more complex models with larger datasets,which leads to an increasing demand for training resources.One machine is obviously unable to meet the increasing resource requirements for model training,so distributed training has gradually become a necessary technology and means for model training.Distributed training uses strategies such as model parallelism and data parallelism to spread training tasks across multi-machine multi-GPU platforms,integrating dispersed resources to meet the resource needs of large-scale model training.However,due to hardware failures,weak designs,and other reasons,large-scale distributed training suffers from low resource utilization rate and a high risk of interruption,which increases the time cost of model training.Many frameworks have been proposed for this problem,such as PyTorch,Horovod and other high-performance computing frameworks,but none of these existing frameworks can solve the above problem well.Therefore,it is of great theoretical and application significance to design and develop a high-performance,scalable,and resilient fault-tolerant distributed training framework.The goal of this paper is to implement a high-performance elastic training framework based on the distributed computing framework Ray,with the following main tasks:First,the distributed training tasks are encapsulated and run using the Ray framework,and the performance scalability of multi-computer training is optimized.Using InfiniBand high-performance computer network communication standard and devices to increase the network bandwidth for inter-node communication,and using Nvidia Collective multi-GPU Communication Library to optimize the intra-node GPU communication performance,the training task can guarantee more than 90%performance scalability with the resource scale of 256 GPU cards.Furthermore,using the advantages of Ray frameworkâs Actor programming model,K8S-based cluster scheduling,and comprehensive interface functions,this paper researches and designs an elastic training mechanism to achieve elastic scaling and training fault tolerance.The elastic training mechanism developed in this paper can catch and handle the exceptions in time when the training process has errors or node resources have changed,dynamically update the cluster resource information and complete the restart of the training task.The training progress is resumed by loading the pre-saved parameter information of the intermediate state module when the training is restarted to achieve smooth transition.Secondly,in this paper,the performance of distributed shared-memory ObjectStore is tested in detail,on this basis,a distributed data cache with an efficient Elastic_DataLoader is designed and implemented-Distributed data caching stores data sets on each node,reducing the memory burden on each node and ensuring that data storage is not easily lost.Elastic_DataLoader achieves more than 6x and 2x performance improvement over PyTorch DataLoader in the single-node and multi-node cases,respectively.Finally,the Elastic_DataLoader part and the elastic training mechanism are integrated into a complete elastic training framework,and its performance and accuracy are tested in a real training task.The results show that the elastic training framework implemented in this paper can provide better performance and robustness for the training task while ensuring the training accuracy. |