Font Size: a A A

Optimization Scheduling Technologies For Multi-dimensional Resources And Distributed Pipeline Parallel Training In Cloud Computing

Posted on:2024-12-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:G Y ZhouFull Text:PDF
GTID:1528307373471094Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the era of rapid development of the Internet,Cloud computing,as a key field of the digital economy,has received broad attention.Cloud computing is an important paradigm in current distributed computing systems.Relying on various fundamental mathematical branches such as discrete mathematics,operations research and stochastic processes,re-source scheduling management of large-scale distributed systems have always been one of the key and difficult points in the field of computer science,which is necessary to simul-taneously consider multi-dimensional resources and multiple optimization objectives,and the optimization modeling and solving theory still needs to be developed.In addition,the scale of artificial intelligence models,mainly represented by deep learning,is becoming increasingly large,and their parallel training relies on the development of Cloud system architectures and resource scheduling technologies.Most resource scheduling problems in distributed systems belong to NP-Hard problems,and the research of the algorithm design and optimization theory lag behind industrial applications.Various existing algo-rithms can be used to obtain feasible scheduling schemes for specific scenarios,but there is significant room for improvement in their computational complexity and optimality.In order to explore methods or strategies to enhance the resource management capabilities of distributed computing systems such as Cloud computing,this dissertation selects five sets of scheduling scenarios(whose problems all belong to NP hard problems)in Cloud environments,and conducts discussions on the design of their new optimization algorithm architecture and algorithm theory.The main research work and conclusions are as follows:(1)To address the scheduling problem of independent task sets in Cloud nodes con-sidering single-dimensional resources,the dissertation proposes a multi-route local search algorithm(MSRA)series using heuristic algorithms as search routes(HLSA).The disser-tation models the problem of minimizing makespan,proposes several MSRA algorithms,and proves the upper limitation approximation ratios of the proposed algorithms are45in the classical problem minimizing the maximum makespan in parallel machines,which im-proves121compared to34of longest processing time alogirthm.The experimental results show that:For homogeneous systems,MSRA in large-scale scenarios reduces the average makespan by≥8.56%;that in small-scale scenarios has the probability 73.56%of obtain-ing the theoretical optimal solution,which is 1.49 times that of the best baseline algorithm.For heterogeneous systems,MSRA reduces the average makespan by≥9.47%.(2)To solve the scheduling problem of independent virtual machine sets in Cloud nodes considering multi-dimensional resources,the dissertation proposes a growable ge-netic algorithms(GGA)series with additional growth route.The dissertation formulates the problem of minimizing the maximum utilization of resources in each dimensions and minimizing system’s energy consumption in heterogeneous nodes,and proposes instanti-ation algorithms GHW-NSGA II and GHW-MOEA/D using HLSA as the growth route.The GGA series reconstructs the genetic algorithm architecture,allowing various algo-rithms to serve as its growth route,thus improving the convergence speed of the genetic solution process and the optimality of convergence solutions.Simulation data sets and public data sets drive experiments to validate the advantages of GGA:the convergence speed of GGA can reach 10 times that of the baseline algorithms of evolutionary algo-rithms(NSGA II and MOEA/D)in large-scale experimental scenarios.(3)Aiming at the scheduling problem of parallel training workflow of deep model using equal Microbatch’s data partitioning-based pipeline parallelism in GPU Cloud server nodes,the dissertation derives the analytic formulas of theoretical cost model(not only simultaneously considering the GPU computing time and network communication time,but also taking into account the nonlinear relationship between them and the data amount),proposes improved multi-dimensional dichotomy(IMD)and IMD-based cross-search al-gorithm(CSIMD).This dissertation proves that the theoretical optimal error of the IMD can approach 0,and proves IMD’s computational complexity is a linear growth function much lower than the best baseline algorithms including dynamic programming and recur-sive algorithm.Parallel training experiments in a realistic distributed environment show that:the average training speeds obtained by CSIMD in CNN-related networks training are respectively 2.0×and 2.5×of baseline strategies GPipe-R and GPipe-E;and that in transformer-related networks training are respectively 1.5×and 1.6×.(4)The dissertation designs a novel parallel training architecture for deep learning,i.e.,unequal Microbatches-based pipeline parallelism(UMPIPE).UMPIPE allows differ-ent processes of the neural network to select different Microbatch sizes,introducing bet-ter training schemes for feasible solutions.In order to solve the optimization scheme of UMPIPE,the dissertation proposed a dual-chromosome genetic algorithm series(DGAP).To tackle the difficulty of calculating the training time corresponding to UMPIPE training scheme,the dissertation further proposes a matrix operation-based two-level accelerated improvement strategy to simultaneously calculate the end training time corresponding to multiple individuals and multiple Microbatches of DGAP.Theoretical analysis proves the optimality of UMPIPE architecture,and proves that the convergence of dual-chromosome strategy is far superior to that of single-chromosome for solving UMPIPE.The exper-iments of training GPT1 and VGG16 in realistic environment show that,the speeds of UMPIPE’s training scheme are increase by 13.89%and 14.36%respectively compared with the optimal training scheme under the baseline architecture GPipe.(5)The dissertation formulates the system model of the hierarchical Cloud comput-ing with multiple subsystems(HCCMS)for the diversity of task requests and diversity optimization objectives.For the joint optimization problem with multiple subproblems,the dissertation proposes a novel perspective of regarding the scheduling algorithms as the schedulable resources,and designs the scheduling framework to select the scheduling al-gorithms.In order to instantiate the algorithm selector series,the dissertation proposes the deep learning-based algorithm selector(DLS)and the deep reinforcement learning-based algorithm selector(DRLS).Compared with the best results among the baseline strategies,DLS reduces the weighted cost of system by 18.8%in the scenarios with the stable pa-rameter range,and DRLS reduces the weighted cost of system by 11.5%in the dynamic scenarios with varying ranges of parameters.This dissertation proposes multiple families of optimization scheduling algorithms for scene element changes at different progressive levels,involving various types of algo-rithms such as heuristic,local search,meta-heuristic,and machine learning.This disser-tation expands the adaptability of Cloud systems and algorithm systems from application breadth and theoretical depth:the research progression from single dimensional resources to multi-dimensional resources has improved the adaptability to changes in resources’di-mensions;the research progression from independent task sets to associated workflow sets has improved adaptability to changes in tasks’correlations;the research progression from single objective to multi-objective has improved the adaptability to changes in the number of optimization objectives;the research progression from single-center with single-layer to multi-centers with multi-layers and multi-subsystems has improved the joint utility in system hierarchy and scheduling scenarios.
Keywords/Search Tags:Cloud Environment, Scheduling of Multi-Dimensional Resources, Pipeline Parallelism, Growable Genetic Algorithm, Algorithm Selector
PDF Full Text Request
Related items