| Nowadays,GPGPU becomes a prevailing computing platform in many domains,e.g.graphic calculation and deep learning.As a result of the SIMT structure of GPGPU,it is suitable for handling massive repeated calculations,which makes GPGPU applications highly threaded,i.e.one single instruction is executed by multiple threads concurrently,as long as there is no data dependency among the corresponding repeated calculations.However,SIMT also brings GPGPU applications the feature of concurrency,i.e.threads executing concurrently generate multiple memory access for one single memory instruction,which leads to the concurrency of memory accesses and places burdens on memory hierarchy.On the other hand,in order to improve programmability and portability,GPGPU introduces virtual memory technology,which requires threads to translate virtual addresses into physical addresses before accessing physical memory,which is called the virtual-to-physical address translation process.GPGPU adopts TLBs that store recently used virtual-to-physical address mappings to speed up virtual-to-physical address translation processes.However,due to hardware constrains like area and power,TLB on GPGPU has difficulties in enlarging capacity and complexity as it could on CPU.As a consequence,it is a challenge for TLB to efficiently handle massive virtual-to-physical address translations due to the concurrency of GPGPU applications.What’s more,it is studied that the overheads of virtual-to-physical address translations on GPGPU are 3 times more expensive than that on CPU,which makes address translation overheads one of the main overheads of GPGPU applications.This paper proposes request coalescing enabled L2 TLB for the concurrency of GPGPU applications and the high overheads of virtual-to-physical address translations,aiming to handle these two problems by exploiting the continuity of address translations.The main work and contributions of this paper includes:1)This paper analyzes the virtual-to-physical address translation requests generated under the structure of GPGPU based on GPGPU-Sim simulator and finds that the execution of one memory instruction could lead to massive concurrent virtual-to-physical address translation requests produced by those threads executing this instruction.TLB on GPGPU,whose capacity and functionality are constrained by area and power,has difficulties in efficiently processing these concurrent requests,thus increasing the overheads of virtual-to-physical address translations.Since L2 TLB is shared among multiple cores under GPGPU structure,it suffers much more from the impact of concurrency.By analyzing,this paper finds that the ability of L2 TLB for handling concurrency is the key to reducing virtual-to-physical address translation overheads;2)This paper designs the structure of request coalescing enabled L2 TLB,aiming to relieving the impact of concurrency on L2 TLB by reducing the number of requests accessing L2 TLB,which is achieved by exploiting the continuity among virtual-to-physical address translation requests and coalescing continuous ones.This paper finds that the memory allocations of GPGPU applications mainly happens before execution and utilizes this finding to map continuous virtual address space to continuous physical address space,providing possibilities for the continuity of virtual-to-physical address translation requests;3)This paper designs a new process of coalescing for the request coalescing enabled L2 TLB that leverages the ability of DRAM to read data continuously to fetch data corresponding to all the requests in one coalesced request successively from DRAM and stores it in a new structure of L2 TLB.In this way,one DRAM access can prepare data for multiple requests,which greatly reduces the number of DRAM accesses and the overheads of the virtual-to-physical address translation process as well.4)This paper reads the source code of GPGPU-Sim in depth and implements virtual memory and the request coalescing enabled L2 TLB structure based on the source code.Then performance is tested on GPGPU-Sim with selected benchmarks,followed by detailed analysis of the results.Experimental results have shown that with the request coalescing enabled L2 TLB,the average stall cycle is reduced by 29.21%. |