Font Size: a A A

Performance Optimization Research For Sparse Numerical Kernels On Sunway Architecture

Posted on:2019-08-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:X L WangFull Text:PDF
GTID:1360330590451798Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The computer numerical simulation is well known as the backbone power to foster the rapid development of both science and industry.And in today's high-end systems,the many-core architectures have become the most promising component,such as the current Top 5 supercomputers,all of which adopt many-core processors.Due to the increasingly fine-grained parallelism and longer vectorization in nowadays and upcoming many-core architectures,thread-level and instruction-level parallel algorithms are urgently needed but have not been well addressed.Meanwhile,solving large-scale linear systems with the iterative method is widely used in numerical simulations and is regarded as one of the most time-consuming components.During the solving process,the sparse matrix-vector multiplication and the preconditioner are the most critical kernels while many issues,such as the poor data-locality,write-conflict,load-imbalance,calculatedependency,lack of vectorization opportunity,frequent cache-checking and fine-grained memory access,hinder the high performance of these kernels when running over manycore architectures.This research focuses on developing high-performance algorithms targeting on the parallelism of thread-level and instruction-level for the sparse matrixvector multiplication and two important preconditioning kernels,the sparse triangular solve and the tridiagonal solve,on the first Chinese home-grown many-core processor,Sunway 26010.The contributions of this research include:· Propose a parallel sparse matrix-vector multiplication algorithm for Sunway architecture to resolve the issues of poor data-locality,write-conflict,load-imbalance,frequent cachechecking,and fine-grained memory access by dividing the sparse matrix into multiple regular blocks and evenly distributing these blocks across all the cores.After evaluating all the 2710 benchmarks from Florida Matrix Collection,the proposed method can achieve an average speedup of 11.7 and the best speedup of 55.0,compared with the sequential method on the management processing element.· Propose the parallel sparse triangular solve for Sunway architecture by first introducing the Sparse Level Tile layout and the Producer-Consumer pairing method.The former one resolves the issues of poor data-locality,calculate-dependency,frequent cache-checking and fine-grained memory access,and the latter one targets on how to conduct irregular computation and synchronization with the regular register-communication.By evaluating all the 2057 square benchmarks from Florida Matrix Collection,the proposed method can achieve an average speedup of 7.8 and the best speedup of 117.3,compared with the sequential method on the management processing element.Compared with the stateof-the-art methods for Intel KNC and NVIDIA GPU running on the corresponding Intel and NVIDIA platforms,the proposed method can obtain the best performance in 1624 benchmarks among total 2057 benchmarks.· Propose the best tridiagonal solves for Sunway,Intel MIC,and NVIDIA GPU architectures.On Sunway architecture,this work proposes distributed Cyclic Reduction(CR)method to best utilize the vectorization and the limited-size and manually-controlled fast memory;For Intel MIC and Nvidia GPU architectures,this work proposes RegisterPCR(-half)-p Thomas method to take full use of the register resources and proposes CRRegister-PCR(-half)-p Thomas method to well balance both the computation and memory access.Compared with the traditional sequential method,the proposed methods can achieve remarkable speedup in all the five tested many-core architectures.
Keywords/Search Tags:Sunway, sparse matrix computation, SpMV, triangular solve, tridiagonal solve
PDF Full Text Request
Related items