Font Size: a A A

Research On The Key Technologies Of Parallel Processing Architecture Optimization Based On Scene Features

Posted on:2024-04-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:S W JiaFull Text:PDF
GTID:1528307340469854Subject:Integrated circuit system design
Abstract/Summary:PDF Full Text Request
With the continuous development of information technology,especially the increasing application driven by big data.Higher requirements are put forward for the computing capability and memory access performance of the processor platform.The computing power of architecture platforms such as the traditional Central Processing Unit(CPU)and Digital Signal Processor(DSP)cannot meet application requirements.As a massively parallel processing architecture designed to provide super computing power,General-purpose graphics processing unit(GPGPU)adopts single instruction multiple data(SIMD)computing model.By simplifying control logic,the chip area is used for computing resources’ increment.In addition,GPGPU ensures the memory access efficiency of multiple threads through the multi-level memory design.The architecture design goal of high computing power and throughput makes GPGPU naturally suitable for parallel accelerated processing of applications.At present,GPGPU has been widely used in several fields.Since the peak power of GPGPU computing continues to increase with the development of microelectronics technology,its underlying hardware structure is complex so that the upperlayer software cannot fully optimize the irregular execution characteristics of different applications.Therefore,a large number of applications cannot be efficiently executed on GPGPU.Our work analyzes the execution characteristics of massive scene algorithms in current popular fields,such as computer vision and biomedicine.Then focus on three main factors,which lead to the inefficient computing of GPGPU.First,branch divergence leads to the decline of computing resource utilization.Second,the low performance of memory access caused by cache contention.Third,the frequently idle of computing resources caused by barrier synchronization.Aiming at these three factors,our work research on the key technologies of GPGPU parallel processing architecture optimization in-depth.The main research of our work is as follow:1.Aiming at the inefficient utilization of computing resource in Breath-First-Search,MUMmer or other applications with dense branch execution characteristics,as well as the existing optimization mechanisms of branch divergence limit the parallelism of multithreads,a branch compaction microarchitecture supports the parallel execution of multipaths is researched.When branch divergence occurs,our design executes branch compaction and records the information of different instruction paths into the same lookup table entry.In addition,our design dynamically updates two bitmasks to mark the warps that can be executed in parallel under all instruction paths.Therefore,multi-threads’ information in different instruction paths can be read by warp issue unit at the same time,thus increasing the parallelism degree of thread in GPGPU SM while improving the utilization of computing resources,and finally improving the performance of scene algorithms with massive branch divergence.Our design achieves an average performance improvement of 4.7%,3.4% and2.3% when compared with baseline and the two latest optimization mechanisms of branch divergence.2.Aiming at the low efficiency of accessing GPGPU L1 Dcache for 2D and 3D convolution,which leads to insufficient development of data localization.A dynamic GPGPU cache bypassing microarchitecture based on the memory access features of 2D and3 D convolution is researched.First,our design defines a set of information,which can reflect the characteristics of each memory access,as well as sampling all memory accesses’ information at runtime.Second,the GPGPU warp scheduling unit is optimized to speed up the sampling process.Finally,the L1 Dcache bypassing rule is defined by using the characteristics of memory accesses.Based on the bypassing rule,L1 Dcache dynamically determines whether L1 Dcache bypassing should be executed for each memory access or not.Since each memory access in 2D and 3D convolution selectively bypasses L1 Dcache,our design ensures that L1 Dcache space is saved for high locality data as much as possible,thus improving the performance and reducing the memory stall cycles.Compared with baseline architecture,our work achieves 2.16% and 19.79% performance improvement in 2D and3 Dconvolution,as well as reducing the stall cycles of L1 Dcache by 7.63% and 21.40%,respectively.3.Aiming at the dense barrier synchronization characteristics in many applications such as SRAD and SP,which leads to the frequently idle of GPGPU computing resources,as well as the effect of memory access on barrier synchronization overhead is ignored by the current optimization mechanisms.A GPGPU microarchitecture design for dense barrier synchronization is researched.First,our design optimizes the warp scheduling unit in baseline architecture,which defining the barrier synchronization cost of each thread block by using the number of warps at the barrier point.In addition,the order of warp scheduling is determined according to the barrier synchronization cost.The higher the barrier synchronization cost of each thread block,the higher the scheduling priority of each warp in it.Therefore,the release of barrier synchronization is accelerated and the synchronization cost is reduced.Second,our design optimizes the L1 Dcache,which combining the L1 Dcache miss rate with barrier synchronization state to determine the L1 Dcache bypassing operation for each memory access.The L1 Dcache bypassing operation can speed up the memory accesses from the thread block with dense barrier synchronization,thus reducing the barrier synchronization cost while ensuring the efficiency of L1 Dcache.Our design achieves an average performance improvement of 4.15%,4.13% and 2.62% when compared with baseline and the two latest optimization mechanisms of barrier synchronization,as well as reducing the barrier synchronization cycles by 17.09%,3.97% and 5.82,respectively.
Keywords/Search Tags:GPGPU parallel processing architecture, application feature, branch divergence, warp scheduling, L1Dcache, barrier synchronization
PDF Full Text Request
Related items