Font Size: a A A

Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects

Posted on:2015-06-09Degree:Ph.DType:Dissertation
University:The Ohio State UniversityCandidate:Potluri, SreeramFull Text:PDF
GTID:1479390020952286Subject:Computer Science
Abstract/Summary:PDF Full Text Request
Accelerators (such as NVIDIA GPUs) and coprocessors (such as Intel MIC/Xeon Phi) are fueling the growth of next-generation ultra-scale systems with high compute density and high performance per watt. Application developers also use a hierarchy of programming models to extract maximum performance from these heterogeneous systems.;Computation and communication overlap has been a critical requirement for applications, to achieve peak performance on large-scale systems. Communication overheads have a magnified impact on heterogeneous clusters due to their higher compute density and hence, a higher wastage in compute power. Modern interconnects like InfiniBand, with their Remote DMA capabilities, enable asynchronous progress of communication, freeing up the cores to do useful computation. MPI and PGAS models offer light-weight, one-sided communication primitives that minimize process synchronization overheads and enable better computation and communication overlap.;This dissertation has targeted several of these challenges for programming on GPU and Intel MIC clusters. Our work with MVAPICH2-GPU enabled the use of MPI in a unified manner, for communication from host and GPU device memories. It takes advantage of unified virtual addressing (UVA) provided by CUDA. We proposed designs in the MVAPICH2-GPU runtime to significantly improve the performance of internode and intranode GPU-GPU communication by pipelining and overlapping memory, PCIe and network transfers. We take advantage of CUDA features, such as IPC, GPUDirect RDMA, and CUDA kernels to further reduce communication overheads. MVAPICH2-GPU improves programmability by removing the need for developers to use CUDA and MPI for GPU-GPU communication, while improving performance through runtime-level optimizations that are transparent to the user. We have shown up to 69% and 45% improvement in point-to-point latency for data movement for 4Byte and 4MB messages, respectively. Likewise, the solutions improve the bandwidth by 2x and 56% for 4KByte and 64 KByte messages, respectively. Our work have been released as part of MVAPICH2 packages, making it the first MPI library to support direct GPU-GPU communication. It is currently deployed and used on several large GPU clusters across the world, including Tsubame 2.0 and Keeneland. We proposed novel extensions to the OpenSHMEM PGAS model that enable unified communication from host and GPU memories. We present designs for optimized internode and intranode one-sided communication on GPU clusters, using asynchronous threads and DMA-based techniques. The proposed extensions, coupled with an efficient runtime, improve the latency of 4 Byte shmem getmem latency by 90%, 40%, and 17%, for intra-IOH, inter-IOH, and inter-node GPU configurations with CUDA, respectively. They improve the performance of Stencil2D and BFS kernels by 65% and 12% on clusters of 192 and 96 GPUs, respectively.;Through MVAPICH2-MIC, we proposed designs for an efficient MPI runtime on clusters with Intel Xeon Phi coprocessors. These designs improve performance of Intra-MIC, Intra-Node and Inter-Node communication on various cluster configurations, while hiding the system complexity from the user. Our designs take advantage of SCIF, Intel's low-level communication API, in addition to standard communication channels like shared memory and IB verbs, to offer substantial performance gains in performance of the MVAPICH2 MPI library. PRISM, a proxy-based multi-channel design in MVAPICH2-MIC allows applications to overcome the performance bottlenecks imposed by state-of-the-art processor architectures and extract the full compute potential of the MIC coprocessors. The proposed designs deliver up to 70% improvement in the point-to-point latency and more than 6x improvement in peak uni-directional bandwidth from Xeon Phi to the Host. Using PRISM, we improve inter-node latency between MICs by up to 65% and bandwidth by up to 5 times. PRISM improves the performance of MPI Alltoall operation by up to 65%, with 256 processes. It improves the performance of 3D Stencil communication kernel and P3DFFT library by 56% and 22% with 1024 and 512 processes, respectively.;We have shown the potential benefits of using MPI one-sided communication semantics for overlapping computation and communication, in a real-world seismic modeling application, AWP-ODC. We have shown a 12% improvement in overall application performance on 4,096 cores. This effort was also part of the application's entry as a Gordon Bell finalist at SC'2010. We demonstrated the potential performance benefits of using one-sided communication semantics on GPU clusters. We presented an efficient design for MPI-3 RMA model on NVIDIA GPU clusters with GPUDirect RDMA and proposed minor extensions to the model that can further reduce synchronization overheads. The proposed extension to the RMA model enables an inter-node ping-pong latency of 2.78usec between GPUs---a 60% improvement over latency offered by send/recv operations. One-sided communication provides 2x the message rate achieved using MPI Send/Recv operations. One-sided semantics improve the latency of a 3DStencil communication kernel---by up to 27%. (Abstract shortened by UMI.).
Keywords/Search Tags:MPI, Performance, Communication, GPU, Clusters, PGAS, Latency, Improve
PDF Full Text Request
Related items