| As deep learning models become deeper and wider,the demand for computing and memory resources also increases,leading to major challenges in multi-model parallelism.In resource-limited scenarios,it is crucial to perform computation scheduling and memory optimization on the model in combination with the characteristics of the application.The objective is to enable parallel execution of multiple models with limited memory resources while meeting the performance requirements of the application.To solve the above issues,this thesis designs a computation scheduling framework named MPInfer and a memory allocator named GAMMA.MPInfer aims to improve the utilization of computing resources and meet the performance requirements of practical applications in the scenario of multi-model parallelism.GAMMA reduces memory usage during model execution,enabling multiple models to run in parallel on resourceconstrained devices.The main contributions of this thesis are as follows:1.In terms of computation scheduling,a multi-model parallel scheduling framework MPInfer is designed and implemented.MPInfer supports various parallel scheduling strategies,such as pipeline parallelism and GPU parallelism,along with CPU-GPU data transfer strategies like pageable memory and pinned memory.GPU parallelism includes parallelism across CUDA streams,parallelism across CUDA contexts,and parallelism across CUDA contexts enhanced by NVIDIA Multi-Process Service(MPS parallelism).2.In terms of memory optimization,a graph-aware predictable memory allocator GAMMA is proposed and implemented.GAMMA consists of two phases.In the first phase,a memory allocation plan is generated by collecting memory allocation information from the first k steps.In the second phase,memory is allocated based on the generated plan for the subsequent steps,thereby reducing memory fragmentation and memory usage.3.A series of experiments on MPInfer and GAMMA in specific application scenarios are performed to illustrate their effectiveness.Firstly,the effectiveness of MPInfer in autonomous driving scenarios involving multi-model parallelism is evaluated experimentally.The results reveal that the best performance is achieved when using parallelism across CUDA streams combined with the pinned memory transfer strategy on Jetson AGX Xavier.This combination leads to an end-to-end latency of 25.2 ms and a throughput of 58.7 FPS.Secondly,experimental evaluations are conducted on GAMMA by using multiple models.The results show that on the CPU side,the model executes an average of 1.19×faster on GAMMA than on Tensorflow,and reduces CPU memory usage by an average of 50.42%compared to TensorFlow.On the GPU side,GAMMA can also reduce GPU memory usage by an average of 12.78%and a maximum of 42.60%.In particular,while TensorFlow can only support one ResNet50 or two DIN models running simultaneously,GAMMA can handle up to two ResNet50 or three DIN models running concurrently.When utilizing MPS parallel mode supported by MPInfer,running three DIN models in parallel can yield an 17.40%increase in throughput compared to running only two DIN models in parallel.In summary,GAMMA can reduce the memory usage of each model,enabling more models to run in parallel on resource-constrained devices and rendering multi-model parallelism feasible.Based on this,MPInfer further enhances the effective utilization of computing resources during multi-model parallel execution,so as to achieve efficient parallelism of multiple models. |