| The existence of rounding errors in floating-point systems leads to the fact that floatingpoint operations do not satisfy the characteristics of the exchange law,and when the computation order is different,the computation results are not the same because of the different rounding errors generated.The multi-stage parallel structure and dynamic resource scheduling of modern computers aggravate the uncertainty of computation and further aggravate the frequency of non-reproducibility phenomenon.We combine error-free transformation technology,prerounding technology and multi-layer chunking technology to implement an efficient reproducible algorithm library based on the Open BLAS design architecture for domestic processor platforms,including the reproducible basic linear algebra function library FT-Repro BLAS and the trusted reductive function library MPI_ACCU_REDUCE.FT-Repro BLAS software library mainly contains three parts: multi-layer chunked reproducible summation algorithm(and the reproducible dot product and parametric functions are implemented on the basis of the summation algorithm),multi-layer chunked dot product algorithm with mixed accuracy,and multi-threaded reproducible DGEMV algorithm.The multi-stage parallel structure including SIMD,Open MP and MPI is designed on the basis of the multi-layer chunked reproducible summation algorithm,and tested on three different ARM platforms,and the multi-layer chunked reproducible summation algorithm can achieve a speedup ratio of 3.5-5 times compared with the mainstream Repro BLAS software library.The mixed-precision multilayer chunked dot product algorithm enriches the computational operations of FT-Repro BLAS.There are two variants of the algorithm,which ensure the efficiency and accuracy of the algorithm by applying different computational precision inside and outside the chunk,thus taking advantage of the low-precision computational power.The multi-threaded reproducible DGEMV function achieves a speedup ratio of at least two times compared to the DGEMV algorithm in Repro BLAS,and when compared to the reproducible DGEMV function in Oz BLAS,the algorithm achieves a speedup ratio of more than 20 in the single-threaded case.MPI_REDUCE is one of the most commonly used global reversion operations in MPI,and its role is to implement global reversion operations for all members of the process group.The trusted imputation function library MPI_ACCU_REDUCE contains five global imputation operations: high-precision summation,high-precision product,highprecision l2 parametrization,reproducible summation,and reproducible exact summation,and the imputation operations are bound to the corresponding imputation operators by the MPI_Op_create function,which provides a reliable computational tool and is of practical importance to the field of massively parallel computing.The research is of practical significance in the field of massively parallel computing.Therefore,the main work of this paper is:1.In view of the non-reproducibility of floating-point computation results,a more ac- curate and efficient multi-layer chunking reproducible algorithm is designed,and an efficient reproducible linear algebra function library FT-Repro BLAS is imple- mented on the domestic processor platform.The function library mainly contains three parts: multi-layer chunking reproducible summation algorithm,multi-layer chunking dot product algorithm with mixed accuracy,and multi-threaded repro- ducible DGEMV algorithm,which provides an effective tool for solving It provides an effective tool to solve the problem of non-reproducible computation results in large-scale scientific computation.2.In response to the unreliability of the global imputation function MPI_REDUCE in MPI,we design and implement the imputation function library MPI_ACCU_REDUCE, which contains five trusted imputation operations,and the corresponding imputa- tion operations can be invoked by using the corresponding imputation operators, which improves the accuracy and reliability of the imputation results. |