| The advent of multi-core/many-core paradigm has provided unprecedented computing power, and it is of great significance to develop a parallelization framework for various scientific applications to harvest the computing power. However, it is a great challenge to design an efficient parallelization framework that continues to scale on future architectures due to the complexity of real-world applications and the variety of multi-core/many-core platforms.;To address this challenge, we propose a hierarchical optimization framework that maps applications to hardware by exploiting multiple levels of parallelization: (1) Internode level parallelism via spatial decomposition; (2) inter-core level parallelism via cellular decomposition; and (3) single-instruction multiple-data (SIMD) parallelization. The framework includes application-based SIMD analysis and optimization, which allows application scientists to determine whether their applications are viable for SIMDization and provide various code transformation techniques to enhance the SIMD efficiency as well as simple recipes when compiler auto-vectorization fails. We also propose a suite of optimization strategies to achieve ideal on-chip inter-core strong scalability on emerging many-core architectures: (1) A divide-and-conquer algorithm adaptive to local memory; (2) a novel data layout to improve data locality; (3) on-chip locality-aware parallel algorithms to enhance data reuse; and (4) a pipeline algorithm using data transfer agent to orchestrate computation and memory operations to hide latency to shared memory.;We have applied the framework to three scientific applications, which represent most of the numerical classes in the seven dwarfs (which are known to cover most high performance computing applications): (1) Stencil computation, specifically lattice Boltzmann method (LBM)for fluid flow simulation; (2) molecular dynamics (MD) simulation; and (3) molecular fragment analysis via connected component detection.;We have achieved high inter-node, inter-core (multithreading), and SIMD efficiency on various computing platforms: (1) For LBM, inter-node parallel efficiency 0.978 on 131,072 BlueGene/P processors, multithreading efficiency 0.882 on 6 cores of a Cell BE, and SIMD efficiency 0.780 using 4-element vector registers of a Cell BE; (2) for MD simulation, inter-node parallel efficiency 0.985 on 106,496 BlueGene/L processors, and inter-core multithreading parallel efficiency 0.99 on the 64-core Godson-T many-core architecture; (3) for molecular fragment analysis, nearly linear inter-node strong scalability up to 50 million vertices molecular graph on 32 computing nodes, and over 13-fold inter-core speedup on 16 cores. In addition, a simple performance model based on hierarchical parallelization is derived, which suggests that the optimization scheme is likely to scale well toward exascale. Furthermore, we have analyzed the impact of architectural features on applications' performance to find that certain architectural features are essential for these optimizations.;This research not only suggests viable optimization techniques for broad scientific applications on future many-core parallel supercomputing platforms, but also provides guidance on effective architectural design of future supercomputing systems. |