| High Performance Computing(HPC)moves towards Exa-scale,while scientific ob-servations and large-scale simulation experiments produce massive amounts of data.It is a significant and challenging task to analyze and process large-scale scientific data and then extract valuable information from them.Against application requirements for large data analysis and processing,it is of great significance to investigate and propose high scalability parallel algorithms for big data processing on distributed computing platform,which can better mine the scientific significance behind massive scientific data,and ex-pedite the process of the data-driven scientific discovery.The core of big data processing often comes down to solving some specific mathe-matical problems,such as the transformation from spectral clustering into matrix eigen-value problem.The matrix multiplication is the most basic algebraic operation.Recently,besides the computational complexity,the communication overhead and scalability have become critical issues in the supercomputing system,of which the scale has been increas-ing rapidly.Nowadays,big data processing methods,e.g.,eigenvalue problem resolving and big data query,still suffer from high computational complexity and poor scalability.This motives us to find out more efficient algorithms in HPC.This paper focuses on the core methods of big data processing–parallel matrix multiplication,matrix eigenvalue algorithm and fast index generation and query technology.We analyze these problems theoretically and experimentally,and then propose more efficient and scalable algorithms accordingly.The innovations and contributions are given as following:In the parallel matrix multiplication,the expansion of system scale will increase the communication cost.In this thesis,we propose a 2.5D parallel universal matrix multi-plication algorithm(PUMMA)and a parallel structured matrix multiplication algorithm(PSMMA)to hand dense and structural matrices respectively.The 2.5D PUMMA re-distributes data from 2D grid of processors to 2.5D,and uses the extra memory on each computing node to store the entire matrix with a few processes.The calculation matrix is backed up in different process groups to reduce the communication cost.By using genera-tors of structural matrix,PSMMA constructs local matrix to reduce communication cost,while the low-rank approximation algorithm is used to further reduce the computation cost.Compared to the classical matrix multiplication algorithm,the proposed algorithms can reduce the communication times,improve the data locality,and have higher scalabil-ity.The acceleration ratio of 2.5D PUMMA to PDGEMM function in Sca LAPACK can reach 2.20-2.93.For large-scale structural matrices,the performance of PSMMA is 12.96times faster than PDGEMM function.To deal with the high computational complexity and poor scalability of the traditional divide-and-conquer(DC)algorithm in eigenvalue problem solving,a parallel structured divide-and-conquer algorithm(PSDC)is designed for the tridiagonal matrix eigenvalue problem.In addition,we propose a parallel banded structure divide-conquer algorithm(PBSDC)to solve banded matrix eigenvalue problems.For special matrices,the worst complexity of the tridiagonal DC algorithm is O(n~3),where n is the dimension of ma-trix.PSDC algorithm uses block low rank(BLR)matrix to approximate the structural matrix,and calls PSMMA to calculate the eigenvectors and eigenvalues.In the tridiago-nal DC algorithm,the worst complexity is reduced to O(n~2r),where r is an approximate constant much less than n in practice.The classical eigenvalue solving process requires the tridiagonalization,which is limited by the bandwidth.PBSDC sidesteps this issue by further exploiting the rank-structured matrix techniques to improve the performance.Experimental results show that the PSDC algorithm performs 1.4-1.6 acceleration ratios compared with the PDSTEDC function in Sca LAPACK for large-size and small-amount matrices.Compared with the tridiagonalization banding algorithm in ELPA,PSBDC al-gorithm has a maximum acceleration ratio of 163 for band matrix with small bandwidth and less deflation.As to large-scale scientific data location problem,we propose a parallel index gen-eration and retrieval method,named Bi-cluster+,in the parallel file system.This method designs a hierarchical structure to generate indexes in parallel.The block index is used as the coarse-grained range index.When a certain proportion of hot data blocks are selected,the fine-grained indexes(e.g.,bitmap index and inverted index)are built in parallel to get the exact location of required data.The index size is effectively reduced and process of index generation is accelerated.Furthermore,a logical data block merging read strategy is designed to maximize I/O throughput by reading multiple consecutive or disconsecutive hit blocks at a time.This method can avoid redundant data retrieval and excessive I/O access overhead caused by different size of data blocks.At the same time,a computing resource optimization strategy is proposed to deal with the load imbalance in inconsistent computing tasks.The retrieval performance of the Bi-Cluster+method is 6.1 times faster than Fast Query,8.5 times faster than block index,and 69 times faster than scan.Mean-while,Bi-cluster+query method has good scalability,which can be extended to at least17496 processes. |