| Deep Neural Networks(DNNs)training has great demand for device compute capacity and memory capacity.Single Device Training is hard to satisfy the training demand of Deep Neural Network training.Parallel training separates the models and the data through a split strategy and distributes them on different devices to satisfy the demand for training and accelerate the training process.However,the complexity of DNN and the large variety of parallel strategy make the optimal parallel strategy hard to find.This problem becomes worse in heterogeneous scenarios.The parallel strategy requires more finetuning efforts because of the imbalance in compute capacity of the heterogeneous devices.This makes model developers take more time to tune on hardware split groups instead of focusing on model development.Automatic parallel training tools have been proposed to separate the model development process and ignore details of underlying parallel training details.Automatic parallel training tools only need small efforts from the user: the user specifies a critical split point,and the tools will automatically search for the optimal parallel strategy.However,existing automatic tools focus on the specific models,making migrating to the new models a non-trivial task.In addition,some works do not take a general description.Thus,it would be difficult for framework developers to migrate one tool to another.This thesis proposes an automatic parallel training library for DNNs.Its foci are usability and portability.From the usability side,it plays as an optimizing graph pass.Users only need to use a unified interface to annotate the critical tensors.Then the tools will automatically transform the user-given single device graph into a multi-device graph for parallel training.Most importantly,this work proposes a unified interface description from the portability side.The user could only use one interface to describe all common model parallel training strategies.The unified interface makes migrating one split algorithm to another an easy task.This work transforms a framework-dependent computation graph into a framework-independent Intermediate Representation(IR)at the backend.It uses this IR to do parallel training.This makes it easy for the framework developers to port algorithms to this library.Because they only have to focus on the conversion between IRs.To provide a unified description of the split algorithm for framework developers,this thesis defines the split strategy as a unified property description.Then,it defines the property propagation process on the graph nodes according to the dim mapping of the tensors.Property propagation makes the user-annotated split strategy could propagate through the computing graph,follow the same rule,and set other nodes automatically.This makes the framework developers could use one property propagation logic to accelerate the parallel training.At last,to support heterogeneous devices,this thesis proposes a simple cost model to accelerate the parallel training.In practice,the library has good acceleration performance and extension efficiency,and it can provide 94% extension efficiency and 3.77 x acceleration ratio on the Res Net-50 model on 4 GPUs. |