| Matrix multiplication has become the computing kernel of scientific computing and artificial intelligence.However,there are still many shortcomings in the current support of matrix multiplication on the Shenwei-26010 processor.In this paper,we study matrix multiplication parallel algorithms that can possess high peak performance and good generality for the Shenwei-26010 processor.In terms of computation,combined with CPE dual-pipeline instruction execution,this paper ensures efficient instruction-level parallelism by manually controlling the instruction sequence of core computation tasks from the assembly level.In terms of data access,this paper comprehensively considers the efficient use of limited on-chip storage and the crossparallelization of computation and data access,thus achieving overall performance improvement of algorithms.In general,the main research work and achievements of this paper are as follows:(1)Aiming at lacking the generality of double-precision matrix multiplication implementations on the Shenwei-26010 processor,this paper designs a runtime adaptive parallel algorithm for double-precision matrix multiplication.We comprehensively consider possible loop orders of matrix dimensions and the overhead of computation and data access,then design a blocking strategy based on overhead functions with better adaptability.To reduce the data access overhead,we design hybrid double buffering and broadcast-broadcast on-chip communication.The former overlaps computation and data access,while the latter maximizes the reuse rate of on-chip data.Because the problem that the generality of the compiler causes the lack of fine-grained hardware modeling,we manually translate the high-level language implementation of the computing kernel to the assembly implementation.Moreover,perform fine-grained instruction rearrangement to maximize the hardware computation capability.Finally,the algorithm is configured with an adaptive engine consisting of multiple overhead formulas and multi-group block factors,allowing it to dynamically decide the execution behavior to cope with variable matrix multiplication scenarios.(2)Aiming at the problem that single-precision matrix multiplication implementations on the Shenwei-26010 processor do not fully integrate its microarchitectural feature for floating-point operations,this paper designs different storage-level parallel algorithms for single-precision matrix multiplication.The algorithms implement three different single-precision matrix multiplications based on register-level,LDMlevel,and memory-level data type conversions.For the register-level conversion,we redesigned and rearranged the instruction sequences of the computing kernel to hide the pipeline bubble between separated instructions.Lots of additional on-chip storage is occupied because of the LDM-level conversion.We propose two schemes of fixed LDM space and nested LDM space to improve the utilization efficiency of on-chip storage resources.Finally,by fusing data type conversion stage and matrix multiplication operation stage,we utilize data reuse and double buffering to eliminate the additional data access overhead caused by the memory-level conversion.Moreover,we design the buffer for partially converted data to reduce the extra usage of the main memory.Compared with swBLAS,the existing optimal official math library on the Shenwei-26010 processor,double-precision matrix multiplication has almost the same peak performance,and single-precision matrix multiplication is improved by 6.8%.The generality of our research is significantly better.The experiments show,in 95.67%and 99%of the matrix multiplication scenarios,our research has a performance improvement of not less than 5%.Moreover,the average performance improvements are 59.81%and 93.66%,respectively. |