Font Size: a A A

Key Technology Research On Implementation And Optimization Of OpenACC On MIC

Posted on:2014-05-07Degree:MasterType:Thesis
Country:ChinaCandidate:C ChenFull Text:PDF
GTID:2308330479979448Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Heterogeneous system with relatively high GFLOPs and low relative power footprints has become current research focus. With GPU’s success in general-purpose computing, Intel has unveiled its Many Integrated Core(MIC) coprocessor based on Intel Architecture(IA). Xeon Phi, the second generation of MIC product, has been successfully deployed in the Milk WayⅡ. However, the obstacle of programming and optimization limit the promotion and application of heterogeneous systems. In recent years, the pragma-based programmer interface like OpenACC standard is frequently studied. Up to now, there is no implementation of OpenACC on MIC.In this paper, we take advantage of OpenACC to program on the newly Intel MIC coprocessor by automatically translating OpenACC source code to Intel Offload(Offload for short) source code. Then we optimize the translated code according to the OpenACC specification and MIC architecture. Our main contributions are as follows:1. Proposed a translation model of mapping OpenACC to Offload. We analyze the essential factors of directive-based heterogeneous programming. That’s task management, parallel describe and data management. Then we proposed the source-to-source translation methods according to the relationship of OpenACC and Offload mainly on these factors. As the divergence of OpenACC specification and MIC, we carefully study the efficiency of the translation policy.2. Proposed the optimization methods according to the efficiency. That’s task repartition, vectorization optimization based on 512-bit SIMD on MIC, and also we designed a tree barrier synchronization algorithm based on SIMD intrinsic.3. Achieved the automatically translation compiler framework of OpenACC source code to Offload source code. Adopting layered structure design, the front-end module translates the OpenACC source code to Internal Representation(IR), after middle-end module optimization, the back-end module outputs the Offload source code.4. Two kernels, the matrix multiplication and JACOBI, are studied on the MIC-based platform(one Xeon Phi card) and the GPU-based platform(one NVIDIA Tesla k20 c card) to analyze the mapping and translation efficiency. Performance evaluation shows that both kernels deliver a speedup of approximately 3 on one Xeon Phi card than on one Intel Xeon E5-2670 octal-core CPU. Moreover, the two kernels gain better performance on MIC-based platform than on the GPU-based one.
Keywords/Search Tags:Heterogeneous System, GPU, OpenACC, MIC, Intel Offload, Barrier Synchronization
PDF Full Text Request
Related items