| Due to cooling power consumption and other factors,performance of traditional single-core processor has failed to keep up with the speed of development of hardware resources.Yet,recent years blossoming of innovative application in the field of high–performance computing requires higher performance of computers.Compare to traditional single-core processors,multi-core/many-core processors can advantage of thread-level parallelism to improve its performance to satisfy higher level demands,which has been widely accepted by the academic world and industrial circle.Despite of higher FLOPS and better computational capabilities,multi-core/many-core processors also have complex structures and programming environment,which make excavating the powerful computing power of the multi-core/many-core processors a prominent problem.To adress this problem,exploring the core algorithm in numerous applications and optimizing multi-core/many-core processors according to its features become particularly important.This paper use dense matrix multiplication,matrix inversion and FFT operation as representative of regular kernels to carry out the research.This paper introduced the basic matrix operations under CUDA architecture firstly,for matrix multiplication,starting from the definition of matrix multiplication,division of the ribbon is achieved;and then take into account to reduce the number of global memory access by using shared memory,improve program performance by using checkerboard array method to realize the division;considering that there are many registers in each SM,by changing the calculation,increase the use of registers,to further improve the program performance.For matrix inversion,the original version is based on Gauss elimination method to alloction two block storage space respectively for the original matrix and unit matrix respectively,and performing normalizd operation and elimination operation,and then noticed a number of operating number in some threads are zero,existing useless operation,wasting of computing resource.In optimized method,the original matrix and unit matrix are merge together,improve the computing resource utilization rate.For the FFT algorithm,based on the analysis of FFT features in parallel,mapping method using multi thread parallel algorithm,and optimize the algorithm from the memory hierarchy;then taking into account the DIF-FFT input data in general is not a normal order,need to reverse the rearrangement,this part of the operation need to be on the CPU side,and frequent data in the host and device transmission will reduce the performance of the program,so we can use the input and output are normal order to avoid DIT-FFT,improve program performance.Experimental results show that the implementation of the three modules in the CUDA architecture can achieve ten times speedup compared with the implementation of CPU,and it has some advantages compared with the CUBLAS library and CUFFT library that comes with CUDA. |