Parallelization framework for scientific application kernels on multi-core/many-core platforms

Posted on:2012-08-20

Degree:Ph.D

Type:Dissertation

University:University of Southern California

Candidate:Peng, Liu

Full Text:PDF

GTID:1458390011451435

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

The advent of multi-core/many-core paradigm has provided unprecedented computing power, and it is of great significance to develop a parallelization framework for various scientific applications to harvest the computing power. However, it is a great challenge to design an efficient parallelization framework that continues to scale on future architectures due to the complexity of real-world applications and the variety of multi-core/many-core platforms.;To address this challenge, we propose a hierarchical optimization framework that maps applications to hardware by exploiting multiple levels of parallelization: (1) Internode level parallelism via spatial decomposition; (2) inter-core level parallelism via cellular decomposition; and (3) single-instruction multiple-data (SIMD) parallelization. The framework includes application-based SIMD analysis and optimization, which allows application scientists to determine whether their applications are viable for SIMDization and provide various code transformation techniques to enhance the SIMD efficiency as well as simple recipes when compiler auto-vectorization fails. We also propose a suite of optimization strategies to achieve ideal on-chip inter-core strong scalability on emerging many-core architectures: (1) A divide-and-conquer algorithm adaptive to local memory; (2) a novel data layout to improve data locality; (3) on-chip locality-aware parallel algorithms to enhance data reuse; and (4) a pipeline algorithm using data transfer agent to orchestrate computation and memory operations to hide latency to shared memory.;We have applied the framework to three scientific applications, which represent most of the numerical classes in the seven dwarfs (which are known to cover most high performance computing applications): (1) Stencil computation, specifically lattice Boltzmann method (LBM)for fluid flow simulation; (2) molecular dynamics (MD) simulation; and (3) molecular fragment analysis via connected component detection.;We have achieved high inter-node, inter-core (multithreading), and SIMD efficiency on various computing platforms: (1) For LBM, inter-node parallel efficiency 0.978 on 131,072 BlueGene/P processors, multithreading efficiency 0.882 on 6 cores of a Cell BE, and SIMD efficiency 0.780 using 4-element vector registers of a Cell BE; (2) for MD simulation, inter-node parallel efficiency 0.985 on 106,496 BlueGene/L processors, and inter-core multithreading parallel efficiency 0.99 on the 64-core Godson-T many-core architecture; (3) for molecular fragment analysis, nearly linear inter-node strong scalability up to 50 million vertices molecular graph on 32 computing nodes, and over 13-fold inter-core speedup on 16 cores. In addition, a simple performance model based on hierarchical parallelization is derived, which suggests that the optimization scheme is likely to scale well toward exascale. Furthermore, we have analyzed the impact of architectural features on applications' performance to find that certain architectural features are essential for these optimizations.;This research not only suggests viable optimization techniques for broad scientific applications on future many-core parallel supercomputing platforms, but also provides guidance on effective architectural design of future supercomputing systems.

Keywords/Search Tags:

Parallel, Many-core, Scientific, Computing, Platforms, SIMD efficiency

PDF Full Text Request

Related items

1	Research On Parallelization Of Scientific Computing Kernels On Multi-core Platform
2	Parallel Design And Optimization Of SpMV On ARM Multi-core Platform
3	Design Of Multi-core Parallel Computing System For Backprojection Imaging
4	A framework for performance tuning and analysis on parallel computing platforms
5	Research On Parallel Optimization Of BLAS Based On The New Generation Of Sunway Many-core Processor
6	The Implementation And Performance Of Computation-intensive Applications On New Many-core Platforms
7	Parallel Computing Research Based On The High Performance Computer
8	Multi-core Processors Computing And Communication Module Design Between Nuclear Research
9	Download System Research And Development Based On The Parallel Multi-core Environment
10	Parallel Optimization And Realization Of HEVC Decoder Based On Multi-Core Processors