Font Size: a A A

Key Techniques Research On GPU Parallel Computing Targeted On Applications

Posted on:2015-11-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:H Y SuFull Text:PDF
GTID:1108330509461078Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the progress of information technology, the computational requirement of largescale scientific and engineering applications is increasing. Due to the restriction of power,in the past ten years, the computational power of computer is mainly attributed from the introduced multi/many-core techniques, instead by increasing the frequency of the processors. For general purpose CPU, the large-scale systems based on multi-core CPUs are still facing the energy problem. Under this circumstance, the technology of accelerator has attracted a lot of attention from academic and industrial researchers. Especially, attributing to the high energy efficiency ratio, the GPU plays an important role in the field of high performance computing. In addition, the CPU-GPU heterogeneous has become an important trend for HPC domain, such as the Titan supercomputer, which is based NVIDIA’s Kepler GPU. However, GPU’s architecture and programming are different from traditional CPU, and the corresponding GPU software parallel technology is relative low,the actual application efficiency and programming efficiency is not satisfied. In the work,based on certain application, we study the key technologies of parallel computing based on GPU and CPU-GPU heterogeneous systems, including parallel programming, programming model, GPU performance modeling and performance portability. Our works are listed as follows:1. After carefully reviewed and profiled the program, we proposed a fully parallel framework for H.264 encoder based on GPU. We introduced the loop partition technology to divide the whole pipeline into four steps(ME, intra coding, CAVLC, deblocking filter) in terms of frame. All the four components are offload to GPU hardware in our framework. The CPU is only responsible for some simple transactions, such as I/O process. The proposed framework exploits the producerconsumer locality between different parts of the encoder, which avoids unnecessary data copy between CPU and GPU. We implemented the whole H.264 encoder with CUDA. For the compute intensive component motion estimation, a scalable parallel algorithm has been proposed targeting massively parallel architecture, named multi resolutions multi windows(MRMW) motion estimation. It calculates the optimal motion vector(MV) for each macroblock(MB) through several steps. In order to overcome the limitations from the irregular components, a direction-priority deblocking filter and a component-based CAVLC parallel schemes have been proposed. The proposed parallel methods can not only improve the performance of the tools, but also reduce the data transferred between host and device. Based on the multi-slice technology, a multilevel parallel method is designed for intra coding to explore the parallelism as much as possible. Our implementation can satisfy the requirement of real-time HD encoding of 30 fps, while the value of PSNR only reduced from 0.14 to 0.77 d B.2. We proposed and implemented an efficient sedimentary basin simulations solver based on GPU-enhanced clusters. We presented three parallel methods of the duallithology fully-explicit sedimentation simulator, including the MPI-based CPUonly,MPI+CUDA based GPU-only and MPI+Open MP+CUDA based CPU-GPU hybrid implementations. We exploit massively parallelism and optimized the onchip memory on NVIDIA’s GPU to utilize GPU effectively. In addition, an overlap technology is proposed to balance the computation and communication. Results show that our CPU-GPU hybrid implementation is able to handle a global mesh resolution of 131072 × 131072, and a double-precision performance of 72.8 TFlops is achieved by using 1024 GPUs and 12288 CPU cores on Tianhe-1A supercomputer.3. We proposed an analytical model to estimate the performance of stencil on GPU from the angle of data traffic. In order to understand, predict the GPU performance of stencil computations and curve the performance bottlenecks, we firstly quantify the performance of stencil computations on GPU. Specifically, we try to answer how and why different optimizations can boost the performance. Based on the analysis, we propose an analytical model to estimate the data traffic volume of stencil program on GPU, thereby estimating the overall execution time of stencil computations. Specifically, three three stages of the data traffic volume are studied:(1)between registers and on-SMX storage,(2) between on-SMX storage and L2 cache,(3) between L2 cache and GPU’s device memory. Three associated granularities are used: a CUDA thread, a thread block, and a set of simultaneously active thread blocks. Numerical experiments of four 3D stencil computations have verified the accuracy of the quantified data traffic volumes. Moreover, by introducing an imbalance coefficient, together with the known realistic memory bandwidths, we can predict the time usage based on the quantified data traffic volumes. For the four 3D stencils, the average error of time prediction is 6.9% for a baseline implementation approach, whereas for a blocking implementation approach the average prediction error is 9.5%.4. Although Open CL programming provides full code portability between different hardware platforms, performance portability can be far from satisfactory. In this work, we use a set of representative 3D stencil computations to study Open CL’s code and performance portability between CPUs and GPUs. For each stencil computation, we have devised different implementations of the computational kernel function, all being 100% code-portable between the two architectures. The most straightforward and compact implementation is, non-surprisingly, the least portable with respect to performance, because such an implementation may hamper effective use of the hardware. By injecting code complexity into the involved loop nests, we can create kernel functions that still have full code portability but with increased performance portability. It is found that appropriate data blocking, implicit vectorization and register reuse are important factors for achieving performance. Comparison is also done with Open MP and CUDA implementations that target the same stencil computations. The most GPU-oriented Open CL implementation, while being completely code-portable, can achieve on average 96% of the corresponding Open MP performance on three Intel CPUs and 96.8% of the corresponding CUDA performance on NVIDIA’s Kepler GK110 GPU.
Keywords/Search Tags:GPU, Video Coding, CPU-GPU Hybrid Computing, Stencil Computation, Performance Modeling, OpenCL, Performance Portability
PDF Full Text Request
Related items