| With the development of big data,the scale of data sets for GPU applications has increased dramatically in recent years,which raises challenges to the GPU memory with limited capacity.With the support of unified virtual memory and demand paging,GPU can execute in an application-transparent manner under memory oversubscription.How-ever,such transparent management still comes at a severe performance cost,especially for applications with inter-kernel data sharing.While there have been many efforts to reduce additional data migrations caused by the memory oversubscription,few consider the reuse of shared data during the boundary of adjacent kernels.Due to limited memory capacity,kernels often demand shared data that has already been evited by the previous kernel,resulting in a significant number of costly data migrations.Therefore,this paper focuses on the reuse of shared data between neighbored kernels.The main contributions are as follows:· Research on characteristics of GPGPU applications: This paper conducts an in-depth study of a great number of workloads in the GPGPU benchmarks.Based on the research,this paper systematically summarizes the programming modes and data access patterns of GPU applications.Besides,applications are classified and counted based on the programming modes.Finally,data sharing between kernels of some test programs was analyzed quantitatively.· CTA-Page cooperative data reuse mechanism: Based on the analysis of appli-cation characteristics and memory access patterns,this paper proposes a CTA-Page cooperative data reuse mechanism targeting applications with similar memory ac-cess characteristics in different kernels,called CPC.It transparently reduces the impact of memory oversubscription using CTA(Cooperative Thread Array)dis-patch switching and page replacement switching coordinately to reuse inter-kernel shared data.Experimental results show that CPC reduces the page fault rate by an average of 46.6% compared to the Baseline,leading to an average of 90% and65% performance improvement than the Baseline and the state-of-the-art memory management framework ETC,respectively.· Hardware-software collaborative data reuse mechanism: This paper proposes a universal mechanism,called JCS,to reuse the shared data between kernels.Based on the global access information of each CTAs obtained in the JIT(Just In Time)and the information of the GPU memory,the CTA scheduling strategy is re-planned so that CTAs with higher priority can reuse shared data more efficiently.Experi-mental results show that for single-kernel-type applications(multi-kernel-type ap-plications)JCS reduces the page fault rate by an average of 26.6%(6.6%)compared to the Baseline,leading to an average of 58%(12.6%)and 38.8%(5%)performance improvement than the Baseline and the state-of-the-art memory management frame-work ETC,respectively. |