Font Size: a A A

High Efficient Matrix Operations On Vector-SIMDE DSPs

Posted on:2014-01-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:K ZhangFull Text:PDF
GTID:1228330479979569Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
As the high performance is increasingly demanded and the power consumption is strictly restricted in applications, the Single Instruction Multiple Data(SIMD) technology has been widely used in Digital Signal Processors(DSPs). Matrix operations have always been the typical problems of the high performance computing area. However, matrix operations on current vector SIMD DSPs suffers many problems, such as low efficient utilizing of computation resources and bandwidth, many memory access conflicts, much more communication overhead, and so on. The above problems highly decrease the performance of the processor. So it is significant to study high efficient matrix operations on vector SIMD processors. This paper adopts techniques such as model analysis, software optimization, hardware assistant, hardware and software co-optimization, and so on. The main contributions are summarized as follows:(1) To support General-Purpose Matrix Multiplication(GEMM) problems which derive from dense linear algebra effectively, this paper shows a framework of high performance DSPs based on SIMD technology. Then we map Goto BLAS library onto the proposed architecture. By investigating factors that influence the performance and efficiency of GEMM, including the execution way of the algorithm, the data transfer among memory hierarchy, the pipeline depth of function units, software pipelining, and loop unrolling et al, we construct a performance model for the GEMM based on SIMD DSPs.(2) Based on the proposed GEMM performance model, we study factors that influence the efficiency of GEMM, including the performance, the memory hierarchy, the core size, and the number of cores et al. Then we make effective design tradeoffs for the proposed high performance DSPs architecture. The analysis and tradeoffs strategy based on the proposed performance model can effectively help designing DSPs that are efficient for the general-purpose HPC and especially efficient for matrix operations.(3) We propose a fine-grained pipelined mechanism of LU decomposition on SIMD processors. The mechanism consists of the fine-grained pipelined algorithm and the fast data share technology between the scalar unit and the vector unit. The fine-grained pipelined algorithm transforms the two sequential tasks into executing in parallel on the scalar unit and the vector unit of SIMD processors. It fully utilizes all the computation resources of SIMD processors and explores the pipeline parallelism. By software optimization method, the proposed algorithm eliminates the non-coalesced memory access and increases the performance of LU decomposition on SIMD processors. The Shared Register File(SRF) provides the mechanism that supports fast data share between the scalar unit and the vector unit, which can further increase the performance of LU decomposition by accelerating the communication between the scalar task and the vector task and reducing the delay and conflict times caused by the communication.(4) We propose a software/hardware technology to accelerate the Sparse Matrix-Vector Multiplication(Sp MV). This paper studies the performance bottleneck of Sp MV on current SIMD architectures and proposes a new Sp MV algorithm based on the Stride-combination CSR with Transpose(SCT) format and the Vector Write Buffer(VWB) technology. The new algorithm based on SCT format can effectively increase the utilization of SIMD units and bandwidth of accessing non-zero elements. The blocked algorithm based on SCT format can eliminate the conflict that caused by the indirect and SIMD accesses to the vector x and increase the bandwidth utilization of accessing the vector x. The VWB combines several divergence write accesses into one contiguous access, which increases the bandwidth utilization of write back operations by reducing the number of memory access. So, our software/hardware Sp MV optimization technology well overcomes the performance bottleneck of Sp MV on current SIMD processors.(5) To accelerate matrix operations with different matrix size in high performance embedded area, we propose a Multi-Grained Matrix Register File(MMRF). The MMRF can be dynamically configured into different operating modes. The MMRF supports both row-wise and column-wise accesses to one or several sub-matrices in parallel, which can eliminate the data rearrangement operations when matrix operations with different matrix size are mapped on SIMD processors. By exploring the data level parallelism and thread level parallelism in matrix operations of embedded applications, the MMRF can effectively increase the performance of SIMD processors. Furthermore, the MMRF can be well applied to existing SIMD processors without modification to their instruction set architecture.
Keywords/Search Tags:Vector SIMD, High performance computing, High Efficient Matrix Operations, General-Purpose Matrix Multiplication, LU Decomposition, Sparse Matrix-Vector Multiplication, Matrix Register File
PDF Full Text Request
Related items