Font Size: a A A

Research On Efficient Matrix Multiplication Parallel Algorithms For Shenwei Heterogeneous Many-core Processor

Posted on:2024-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:Z WuFull Text:PDF
GTID:2568306932962299Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Matrix multiplication has become the computing kernel of scientific computing and artificial intelligence.However,there are still many shortcomings in the current support of matrix multiplication on the Shenwei-26010 processor.In this paper,we study matrix multiplication parallel algorithms that can possess high peak performance and good generality for the Shenwei-26010 processor.In terms of computation,combined with CPE dual-pipeline instruction execution,this paper ensures efficient instruction-level parallelism by manually controlling the instruction sequence of core computation tasks from the assembly level.In terms of data access,this paper comprehensively considers the efficient use of limited on-chip storage and the crossparallelization of computation and data access,thus achieving overall performance improvement of algorithms.In general,the main research work and achievements of this paper are as follows:(1)Aiming at lacking the generality of double-precision matrix multiplication implementations on the Shenwei-26010 processor,this paper designs a runtime adaptive parallel algorithm for double-precision matrix multiplication.We comprehensively consider possible loop orders of matrix dimensions and the overhead of computation and data access,then design a blocking strategy based on overhead functions with better adaptability.To reduce the data access overhead,we design hybrid double buffering and broadcast-broadcast on-chip communication.The former overlaps computation and data access,while the latter maximizes the reuse rate of on-chip data.Because the problem that the generality of the compiler causes the lack of fine-grained hardware modeling,we manually translate the high-level language implementation of the computing kernel to the assembly implementation.Moreover,perform fine-grained instruction rearrangement to maximize the hardware computation capability.Finally,the algorithm is configured with an adaptive engine consisting of multiple overhead formulas and multi-group block factors,allowing it to dynamically decide the execution behavior to cope with variable matrix multiplication scenarios.(2)Aiming at the problem that single-precision matrix multiplication implementations on the Shenwei-26010 processor do not fully integrate its microarchitectural feature for floating-point operations,this paper designs different storage-level parallel algorithms for single-precision matrix multiplication.The algorithms implement three different single-precision matrix multiplications based on register-level,LDMlevel,and memory-level data type conversions.For the register-level conversion,we redesigned and rearranged the instruction sequences of the computing kernel to hide the pipeline bubble between separated instructions.Lots of additional on-chip storage is occupied because of the LDM-level conversion.We propose two schemes of fixed LDM space and nested LDM space to improve the utilization efficiency of on-chip storage resources.Finally,by fusing data type conversion stage and matrix multiplication operation stage,we utilize data reuse and double buffering to eliminate the additional data access overhead caused by the memory-level conversion.Moreover,we design the buffer for partially converted data to reduce the extra usage of the main memory.Compared with swBLAS,the existing optimal official math library on the Shenwei-26010 processor,double-precision matrix multiplication has almost the same peak performance,and single-precision matrix multiplication is improved by 6.8%.The generality of our research is significantly better.The experiments show,in 95.67%and 99%of the matrix multiplication scenarios,our research has a performance improvement of not less than 5%.Moreover,the average performance improvements are 59.81%and 93.66%,respectively.
Keywords/Search Tags:High Performance Computing, Parallel Algorithm, Heterogeneous Many-Core Processor, Shenwei-26010 Processor, Matrix Multiplication
PDF Full Text Request
Related items