| The finite element method is a numerical simulation method used to verify the feasibility of engineering designs.In some critical engineering fields,such as nuclear power engineering,large-scale and high-precision simulations are required.However,due to memory limitations,serial finite element methods cannot solve large-scale simulation problems,and therefore parallel finite element methods are needed.The Hybrid Total Finite Element Tearing and Interconnecting Method(HTFETI)is currently the most advanced parallel finite element method,which can run on thousands of computing nodes to solve large-scale problems.In order to efficiently use HTFETI on domestic heterogeneous supercomputing platforms,it is necessary to conduct research on computation acceleration and load balancing.This article analyzes the iterative solution process and task partitioning of HTFETI.The iterative solution process of HTFETI involves a large number of sparse matrixvector multiplication calculations.The task partitioning of HTFETI achieves load balancing by performing secondary domain decomposition and multilevel partitioning algorithms on the model’s grid elements,with each process handling multiple subdomains as clusters.The existing computational acceleration solution converts sparse matrixvector multiplication calculations into dense matrix-vector multiplication calculations and directly offloads them to heterogeneous devices for acceleration,without considering the design of specific calculation schemes when the matrix size changes.The existing load balancing solution only considers the balance of the number of grid elements between clusters.This article finds that due to the spatial differences in the model structure,the assembled matrix size is different after task partitioning,resulting in differences in computational workload between clusters,thus further balance is necessary.The main research work of this article is to achieve HTFETI computational acceleration and load balancing based on a domestically produced heterogeneous supercomputing platform.(1)For the two different scale characteristic matrix-vector multiplication operators of HTFETI,large-scale square matrices and small-scale fat matrices,a multi-stream asynchronous pipeline scheduling and variable-scale batch processing kernel function are proposed for their calculation.The multi-stream asynchronous pipeline scheduling executes the kernel function in multiple streams in parallel,applies locked memory to improve access efficiency,and uses pipeline scheduling to overlap data copy and data calculation operations.The variable-scale batch processing kernel function is highly parallel in implementation and can perform calculations on matrices of different sizes.(2)To address the load imbalance issue of HTFETI,a coarse-grained load balancing algorithm based on graph repartitioning and a fine-grained load balancing algorithm based on work stealing are proposed.The coarse-grained load balancing algorithm adjusts mesh elements and subdomains between clusters by defining subdomain weights related to computational load and estimated runtime.The fine-grained load balancing algorithm obtains actual runtime,determines high-load and low-load processes,and transfers computation tasks between processes to achieve load balancing.(3)A multi-strategy optimized HTFETI operating framework is constructed on a domestic heterogeneous supercomputing platform,and experimental results are analyzed.The multi-stream asynchronous pipeline scheduling achieves a data throughput improvement of 10% ~ 36%,the variable-scale batch processing kernel function achieves an average GFLOPs improvement of 37%,and the multi-granularity load balancing strategy reduces the load imbalance rate from 150% to 105% ~ 109%.The above strategies combined increase the HTFETI iterative solution speed by 1.64 to 2.65 times.In addition,large-scale scalability experiments and Chinese experimental fast pile whole pile core component simulation experiments were conducted,achieving 78% weak scalability parallel efficiency and 72% strong scalability parallel efficiency on 3072 computing nodes,demonstrating its ability to solve large-scale problems. |