The Research And Implementation Of Global Shared Memory For Sparse Matrix-Vector Multiplication

Posted on:2023-01-02

Degree:Master

Type:Thesis

Country:China

Candidate:J F Cui

Full Text:PDF

GTID:2568307169983529

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

Sparse Matrix-Vector Multiplication(Sp MV)has long been an important computing kernel and is widely used in engineering computing,scientific computing and other fields.In artificial intelligence,the substantially increasing order of magnitude of neural network parameters brings large-scale sparse matrices.Both in terms of time efficiency and energy efficiency,higher optimization requirements have been put forward for Sp MV.Therefore,many researchers focus on how to improve the computational performance of Sp MV.Sp MV is a typical I/O-intensive computing task,and the memory access bandwidth is under great pressure.On most computing platforms,the memory access bandwidth is the bottleneck that limits its efficiency.On-chip memory is one of the effective methods to relieve memory access pressure.The use of on-chip memory with non-cache structure and matrix pre-processing can achieve excellent performance when the sparse matrices are regular.But the sparse matrices in practical engineering applications are different.The nonzero elements’ distribution is more sparse and irregular,and it is difficult to find a suitable partitioning strategy.In this case,the efficiency of the non-cache structure will be significantly reduced.The M processor is a high-performance multi-core processor independently developed by the National University of Defense Technology.Its design requirements include the optimization for computing efficiency of artificial intelligence tasks such as convolutional neural network.Therefore,it is necessary to provide a general-purpose on-chip memory design that can be used for different matrix modes.In order to solve this problem,we did the following work in this paper:(1)We analyzed sparse matrices covering a variety of practical application fields,and explored the distribution of non-zero elements.Then we evaluated in detail whether these matrices have better data locality,so that the cache structure can utilize its advantages well.From the results,the evaluated matrices commonly have some regular non-zero elements and some randomly distributed non-zero elements.This feature is suitable for the cache structure.(2)We explored the cache design space by software simulation,and tried to find a balance between overhead and performance.We analyzed the cache capacity,cache line capacity,mapping strategy and replacement strategy.These parameters have a great impact on cache performance.Finally we found out a group of suitable value.(3)We designed and implemented a global shared memory(GSM)with a reduced control pipeline,and also designed a miss buffer with linked storage.This miss buffer design can flexibly provide storage space for miss requests and minimize pipeline stalls caused by insufficient miss buffer space.In this GSM design,all requests processed by GSM are partitioned into three types.The priority is determined by the request arbitrate module.All the requests are processed according to a unified pipeline program,which effectively reduces the pipeline stalls caused by bypassing.(4)We conducted module-level verification and logic synthesis on the GSM design.We first summarized function point list for verification based on the design document.We built a verification platform with System Verilog,and then we designed generating and constraint rules for incentive signals according to the verification requirements.We also used code coverage analysis to find out the code design bugs or the imperfections of the incentive signals.Finally,all function points are verified correctly,and the code coverage is 100%,and area,power and timing results satisfy design requirement;(5)Based on the verification platform,the cache performance of this GSM design was evaluated.Compared with the theoretical performance of the Sp MV calculation mode using the non-cached structure and matrix pre-processing,it is confirmed that the design in this paper is more efficient in most cases and has better adaptability to Sp MV calculation tasks in practical applications.

Keywords/Search Tags:

SpMV, GSM, Cache, Pipeline, MSHR, DSP, HPC

PDF Full Text Request

Related items

1	GPU Texture Cache Pipeline Defect Diagnosis And Optimization Based On 3D Rendering
2	GPU-based SpMv Parallel Acceleration And Performance Optimization
3	Design And Implementation Of L1D Cache For X-DSP
4	Computing SpMV on FPGAs
5	The Design And Implementation Of Non-blocking And Miss-pipeline Global Cache On XDSP Chip
6	Parallel Design And Optimization Of SpMV On ARM Multi-core Platform
7	Optimization And Research Of SpMV Algorithm Based On DCU Accelerator
8	The Design And Implementation Of High Performance Secondary Cache On YHFT-DX DSP
9	A high-bandwidth memory pipeline for wide issue processors
10	The Design And Implementation Of High Performance Level Two Cache Controller On DSP Chip