Font Size: a A A

Optimization Techniques For Distributed Machine Learning Systems

Posted on:2024-03-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z H ChenFull Text:PDF
GTID:1528307070960639Subject:Software Engineering
Abstract/Summary:PDF Full Text Request
With the development of Internet,industries are running in informatization,digitalization and intellectualization,popularizing machine learning.The up-applications drive the development of underlying systems.As a result,distributed machine learning systems emerge,providing data engineers with convenient interfaces and efficient computing platforms.Currently,the systems can be divided into two classes.The first class specializes in big data applications that require batch processing of large-scale datasets.Those systems divide data across clusters,thus reading data and performing computation in parallel.The second class serves large model applications which requires training large deep neural networks.The applications handle only small batches of data at a time,but the trained models are complex in structure and large in parameter size.In order to manage models efficiently,the systems divide models across clusters,thus parallelizing training tasks.However,while distributed machine learning systems use clusters to accelerate computation,they also suffers from communication overhead.For big data applications,the cluster nodes have to exchange input data or intermediate results,incurring overwhelming data transmission.For large model applications,the division of models also leads to the divided computation process.Systems have to remotely access model parameters or transfer intermediate data,which frequently interrupts the computation process.Moreover,machine learning are evolving rapidly,demanding higher performance of underlying systems.To this end,this thesis focuses on the application features as well as performance bottlenecks,and explores the optimization techniques of distributed machine learning systems from four aspects.The main work and contributions are summarized as follows.(1)Hybrid evaluation for iterative computation in big data analysis,reducing unnecessary overheads.In general,big data analysis involves time-consuming iterative computation,which hinders engineering efficiency and the timeliness of algorithm results.However,there is different convergence rate of iteratively-updated elements.That means systems do not need to repeat iterative computation on converged elements.Hence,this work first adopts incremental computation to avoid unnecessary computation overhead,and proposes matrix reorganization to reduce communication overhead.Second,the extra operations of incremental computation also degrade system performance,especially when no element converges.For this reason,this work proposes hybrid evaluation that determines the using of incremental computation based on a cost model at runtime,achieving 8.0x improvement overall.(2)Adaptive elimination for redundant computation in big data analysis,avoiding waste of cluster resources.Data engineers rely on linear algebra to represent machine learning algorithms.Due to the complexity of algorithm logic,the expressions may contain redundant subexpressions,which waste a large number of cluster resources in big data analysis.To eliminate redundancies,this work first proposes a block-wise search method,utilizing the characteristics of linear algebra to reduce the search space.Second,to generate an efficient execution plan with redundancy elimination,this work proposes adaptive redundancy elimination based on dynamic programming.The technique solves the combinatorial explosion of redundancy elimination options and saves the overhead of cost evaluation.Overall,adaptive elimination achieves 14.4x improvement.(3)Elastic averaging-based pipeline for training compute-intensive models,addressing idle GPU.Training compute-intensive models involves large memory footprint and high arithmetic intensity.To overcome this,systems adopt pipeline parallelism,which splits the model and processes data in pipeline.However,since the directions of forward and backward propagation are opposite,the computational resources idle periodically.To address this,this work proposes an elastic averaging-based framework.The framework initiates parallel pipelines to increase training parallelism,thus reducing the percentage of idle time.Moreover,this work proposes dynamic forward propagation scheduling,which consumes intermediate data in advance to reduce the footprints of parallel pipelines.In addition,this work proposes a profiling-based tuning method to setup the parallelism degrees introduced by the framework.Therefore,this work proposes a profiling-based tuning method to guarantee system performance.Finally,the elastic averaging-based pipeline achieves 1.7x improvement.(4)Workload-aware schedule for training data-intensive models,overlapping communication latency with computation.The training of data-intensive models appears in recommendation scenarios.The models rely on large-scale embedding tables to map various features of input data.Existing systems partition embedding tables to multiple servers.Therefore,the computing nodes have to send remote data requests to access embedding tables,which incurs communication latency and blocks the training progress.For this reason,this work proposes a workload-aware scheduling strategy.The strategy utilizes the computational process of feature processing and the critical path phenomenon to overlap communication and computation,achieving1.4x improvement.In summary,this thesis addresses the pain point problem of big data and big model applications,and improves system performance from the perspectives of framework,execution plan optimization,and task scheduling,which has implications for future research.
Keywords/Search Tags:Machine Learning System, Distributed Computing, Data Analysis, Model Training, Query Optimization, Task Scheduling
PDF Full Text Request
Related items