Performance Optimization Research For Sparse Numerical Kernels On Sunway Architecture

Posted on:2019-08-01

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X L Wang

Full Text:PDF

GTID:1360330590451798

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The computer numerical simulation is well known as the backbone power to foster the rapid development of both science and industry.And in today's high-end systems,the many-core architectures have become the most promising component,such as the current Top 5 supercomputers,all of which adopt many-core processors.Due to the increasingly fine-grained parallelism and longer vectorization in nowadays and upcoming many-core architectures,thread-level and instruction-level parallel algorithms are urgently needed but have not been well addressed.Meanwhile,solving large-scale linear systems with the iterative method is widely used in numerical simulations and is regarded as one of the most time-consuming components.During the solving process,the sparse matrix-vector multiplication and the preconditioner are the most critical kernels while many issues,such as the poor data-locality,write-conflict,load-imbalance,calculatedependency,lack of vectorization opportunity,frequent cache-checking and fine-grained memory access,hinder the high performance of these kernels when running over manycore architectures.This research focuses on developing high-performance algorithms targeting on the parallelism of thread-level and instruction-level for the sparse matrixvector multiplication and two important preconditioning kernels,the sparse triangular solve and the tridiagonal solve,on the first Chinese home-grown many-core processor,Sunway 26010.The contributions of this research include:� Propose a parallel sparse matrix-vector multiplication algorithm for Sunway architecture to resolve the issues of poor data-locality,write-conflict,load-imbalance,frequent cachechecking,and fine-grained memory access by dividing the sparse matrix into multiple regular blocks and evenly distributing these blocks across all the cores.After evaluating all the 2710 benchmarks from Florida Matrix Collection,the proposed method can achieve an average speedup of 11.7 and the best speedup of 55.0,compared with the sequential method on the management processing element.� Propose the parallel sparse triangular solve for Sunway architecture by first introducing the Sparse Level Tile layout and the Producer-Consumer pairing method.The former one resolves the issues of poor data-locality,calculate-dependency,frequent cache-checking and fine-grained memory access,and the latter one targets on how to conduct irregular computation and synchronization with the regular register-communication.By evaluating all the 2057 square benchmarks from Florida Matrix Collection,the proposed method can achieve an average speedup of 7.8 and the best speedup of 117.3,compared with the sequential method on the management processing element.Compared with the stateof-the-art methods for Intel KNC and NVIDIA GPU running on the corresponding Intel and NVIDIA platforms,the proposed method can obtain the best performance in 1624 benchmarks among total 2057 benchmarks.� Propose the best tridiagonal solves for Sunway,Intel MIC,and NVIDIA GPU architectures.On Sunway architecture,this work proposes distributed Cyclic Reduction(CR)method to best utilize the vectorization and the limited-size and manually-controlled fast memory;For Intel MIC and Nvidia GPU architectures,this work proposes RegisterPCR(-half)-p Thomas method to take full use of the register resources and proposes CRRegister-PCR(-half)-p Thomas method to well balance both the computation and memory access.Compared with the traditional sequential method,the proposed methods can achieve remarkable speedup in all the five tested many-core architectures.

Keywords/Search Tags:

Sunway, sparse matrix computation, SpMV, triangular solve, tridiagonal solve

PDF Full Text Request

Related items

1	Research On Sparse Matrix Storage Format Suitable For Vectorization
2	The Research On The Computing Problems And The Properties About Special Matrices
3	Preconditioning for matrix computation
4	Computer Algebra To Solve The Differential Equation Method And Its Machine Realization
5	Nonlinear approximation techniques to solve network flow problems with nonlinear arc cost functions
6	Research On Heterogeneous Parallel Algorithms For Sparse Matrix Computation
7	A subspace method based on a differential equation approach to solve unconstrained optimization problems
8	New Algorithms And Properties For Some Special Matrices
9	The thought processes of ninth grade students from the University of Puerto Rico's secondary school when using TI -73 graphic calculators to solve single variable linear inequalities in elementary algebra
10	Research Of Fast Parallel Algorithm For Sparse Linear Systems On CPU+GPU Heterogeneous Platforms