Font Size: a A A

Research And Optimization Of DGEMM Based On ARMv8 Multi-core Processors Architecture

Posted on:2022-12-31Degree:MasterType:Thesis
Country:ChinaCandidate:W ZhangFull Text:PDF
GTID:2558307169478354Subject:Engineering
Abstract/Summary:PDF Full Text Request
As high performance computing(HPC)plays a more critical role in the fields of medical,energy,environment,and bioengineering,it has been considered as one of the elements reflecting national development of science and technology.At the same time,scientific and engineering computations based on HPC systems are supported by highperformance basic linear algebraic mathematical libraries.Basic Linear Algebra Subprograms(BLAS)are widely used in scientific and engineering computations as the main components of many intensive linear algebraic operations.BLAS standardizes a set of application programming interfaces that can be used for different implementations.It has been shown that most of the computational routines in BLAS Level-3(matrix-matrix calculus)can be constructed by generalized matrix multiplication(GEMM).In the field of HPC,since double-precision general matrix multiplication(DGEMM)has become a core part of the LINPACK benchmark test to measure the potential performance of HPC systems,therefore optimization of DGEMM has become significant to high-performance BLAS libraries.In recent years,with the wide application of HPC,the main hardware constituting HPC systems is changing towards the trend of high performance and energy efficiency.Among them,ARMv8 architecture has been used in the construction of HPC system due to its performance and energy efficiency advantages.For example,the Phytium 2000+ and Fujitsu A64 FX were used to build the Tianhe-3 supercomputer in China and the Fugaku supercomputer in Japan,respectively.For multi-core processor architectures based on the ARMv8,the architecture features include support for a wider addressing range,Neon extension structures,Scalable Vector Extensions(SVE),double precision floating point values supported by Neon vector units,and FMA SIMD instructions,which facilitate the development of more efficient DGEMM kernel.However,the increasing number of uniprocessor cores,the complexity of the processor memory hierarchy,and the enhanced non-uniform memory access(NUMA)effect lead to a limited performance of DGEMM on ARMv8 multi-core processor architectures.These new features and issues bring new challenges to the development of DGEMM in BLAS library.In this paper,we studied the performance of DGEMM on its ARMv8multi-core processor architecture based on Huawei’s own Kunpeng 920 processor in the context of double-precision matrix multiplication operations.We also optimized the singlecore performance and multi-core performance of DGEMM based on ARMv8 multi-core processor architecture.The main work and innovation points of the paper include:(1)To address the challenges posed by the architecture features of the Kunpeng 920 processor for DGEMM optimization.We designed and optimized the kernel GEBP of DGEMM based on Open BLAS for the Kunpeng 920 processor.First,we improve the classical computation-access ratio model for Kunpeng 920 architecture,and then used the model to analyze the size of register and cache blocks.Secondly,we optimized the design of the register kernel based on the blocked algorithm used in Goto BLAS,using efficient Neon vector instructions,cache prefetch instructions,and data transfer instructions for the 64-bit ARMv8 instruction set architecture,as well as using loop unfolding and instruction scheduling techniques.Finally,we evaluated its performance on a Kunpeng 920 processor.For large-scale matrix multiplication,the optimized DGEMM shows significant performance improvement over the original DGEMM,with an average performance improvement of 15.26% and a peak performance improvement of 13.7%.(2)To address the performance bottleneck and scalability issues of DGEMM caused by NUMA effect.In this paper,we proposed a NUMA-Aware parallel optimization method with the goal of reducing the number of cross-Die and cross-Chip memory access events.NUMA-Aware DGEMM achieves two levels of parallelism between and within NUMA nodes,and most critically,enables NUMA nodes to obtain independent tasks and achieve data localization,and justifies the overhead of data localization.In this paper,we implemented the approach based on the functional interface of DGEMM in Open BLAS and evaluate the performance and scalability of DGEMM on Tai Shan2280 server.The results show that NUMA-Aware DGEMM improves the average performance by 10.29%and the peak performance by 13.69% compared with Open BLAS DGEMM,and reduces cross-Die read and cross-Chip read operations by about 24.8% and 22.6%,respectively,and cross-Die write and cross-Chip write operations by 62.6% and 29.6%,respectively.
Keywords/Search Tags:DGEMM, ARMv8, Kunpeng 920, High-Performance Computing, Non-uniform Memory Access, Kernel Optimization, Parallel Optimization
PDF Full Text Request
Related items