Research On High Performance Dense Linear Algebra Libraries

Posted on:2021-03-23

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X Su

Full Text:PDF

GTID:1368330611993123

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Dense Linear Algebra(DLA)software are fundamental in scientific and engineering computation that almost all numerical applications rely on the computation on matrices.The Basic Linear Algebra Subprograms(BLAS)library lays the ground for numerical DLA computation,making it the most important library in the software stack of the DLA domain,.BLAS defines a collection of routines which can be used as basic building blocks for DLA programs.As the BLAS application programming interface(API)are so widely used in both academic and industrial areas,it has become the de facto standard in DLA computation.Among all BLAS routines,GEneral Matrix Multiply(GEMM)is the most important one because most other matrix-matrix routines in BLAS can be build upon an optimized GEMM routine.As a result,optimizing GEMM has always been the focus in BLAS development.The evolvement of computing hardware is always rapid.For general CPU processors,the architecture design is witnessing several trends in recent years,for instance,the constant growing of on-chip processor cores,the wide adoption of the Single-InstructionMultiple-Data(SIMD)extension,the complication in memory hiearachies,and the NonUniform-Memory-Access(NUMA)property of large computing systems.These new architectural features bring challenges to the development of high-performance BLAS libraries.The thesis aims at addressing critical issues in developing BLAS libraries for multi-core and many-core processors.The main contributions are summarized as follows.(1)We propose a portable compiler-based approach,Poca,for automatically generating and optimizing important kernel routines in the DLA domain.In current BLAS implementations,kernel routines are usually written and hand-tuned in assembly by domain experts to fully exploit the capabilities of underlying hardware.Coding in assembly,which is tedious and error prone,costs a lot of expert labor.What is worse,the methodology is non-portable because assembly kernels must be developed for every new processor.The key insight of Poca is to leverage a wide range of architecture-specific abstractions(e.g.,SIMD engines supported and instruction latencies)already available in LLVM,by first generating a vectorized μkernel in the architecture-independent LLVM IR and then boosting its performance by applying a series of domain-specific yet architectureindependent optimizations.The optimized μkernel,obtained without any involvement of domain experts,drops easily in existing BLAS frameworks such as OpenBLAS and BLIS.(2)We present a Shared Cache Partitioning(SCP)method to reduce inter-thread cache conflicts on architectures with shared non-LRU caches.Current GEMM implementations assume that both L1 and L2 caches are private to processor cores and have LRU replacement policy.However,a few emerging processors come with non-LRU shared caches.On these processors,data from different threads may conflict with each other in the shared cache,thus increasing the cache miss rate.The basic idea of our SCP method to partition the shared cache into physically disjoint sets and assign different sets to different threads.This can be achieved by exploiting the memory mapping mechanism of the set-associative cache.(3)We propose a hybrid-grained dynamic load balancing method to reduce the penalty caused by the NUMA effect on NUMA architectures.As the GEMM problem has a regular shape,a coarse-grained strategy has been used to parallelize GEMM,in which the whole workload is equally partitioned among all threads.This coarse-grained strategy works well on a homogeneous processor because all threads run at roughly the same speed.However,this is not the case on NUMA architectures.On NUMA architectures,the processor cores witness different memory latency when accessing different memory nodes,so threads on different cores may run at different speeds.The GEMM performance suffers from the variation in thread speed because the overall executing time is determined by the slowest thread.Our proposed dynamic load balancing method is essentially a work-stealing algorithm specifically designed and optimized for the GEMM problem.It is extreamly efficient by completely avoiding the task queues,dependency graph and thread locks which are common in general work-stealing algorithms.

Keywords/Search Tags:

High Performance Computing, Linear Algebra, Matrix Multiply, Compiler, Parallelization

PDF Full Text Request

Related items

1	Design And Implementation Of High Performance Parallelization Level3 BLAS On Multicore DSP
2	High-performance linear algebra on reconfigurable computing systems
3	Compiler Techniques for High Performance Computing, Energy Efficiency, and Resilience
4	The Parallelization Research Of Map Algebra Of Spacial Analysis In Heterogeneous Computing Environment
5	Performance Optimization Of Sparse Matrix Multiplication For High-Performance Computing Platforms
6	Design And Implementation Of The Vectorizing Complier Of High-Performance DSP Matrix
7	Compiler-based memory optimizations for high performance computing systems
8	The Design And Implement Of Floating-point Fused-multiply-add Unit For High-performance Microprocessor
9	Parallelization for MIMD multiprocessors with applications to linear algebra algorithms
10	Design And Implementation Of Batch Matrix Multiplication Framework For High Performance Computing