Automatic Generation And Performance Optimization Of Code In Stencil Computation

Posted on:2017-05-02

Degree:Doctor

Type:Dissertation

Country:China

Candidate:T Q Mo

Full Text:PDF

GTID:1368330488471369

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With increasing varieties of application executing effectively on GPU,the graphic accelerator originally specializing graphical application is becoming increasingly used in many general-purpose applications.Stencil computation,a type of scientific computation,can be ported to GPU and be accelerated significantly by utilizing a range of memory resource.In stencil computation,value in the calculated point in the iterated loop updates incessantly through the use of its neighbouring values.In all interations the exterior loop is usually related with physical time.In every physical time step,the arrays with many dimentions in interior loop are defined as functions of the neighbouring arrays.Currently,a variey of stencils are widely used in computational electromagnetic,solution of partial equation and imaging of CT or MRI,even in acceleration in physical simulation such as turbulelnt flow and propagation of seismic wave.The feature of GPU architecture can not be made available to improve performance in stencil computation for application specialist.Hence for more improved performance in the underlying system level and favourable portability,It is rather significant to study and project domain-specific language on stencil computation on GPU.In my thesis,aiming at designing domain-specific language which generates efficient stencil code and optimizes bandwidth of memory on GPU in the massive computation,stencil specification,the generating algorithms and policies of optimization based on architectual feature are studied in detail.In the project,the code related with architectural feature can be generated by the stencil specification in the abstract syntax.The new intermediate representation can be formed from the input which is hand-coded rule or extracted from stencil code programmed by the higher programming language.All fields and the corresponding stencil functions are represented in the abstract syntax tree.All data types and semantic about write or read operations are defined in every field.The multiple stencil functions must be executed semantically in sequential order.In the project,the CUDA or Open CL code with overlapped ghost zone is generated automatically.The local redundant computation in threads rather than communication between threads in different iterating tilings can make GPUs' computing power fully available and reduce communication ovehead between warps or threads group.In this project,the host code and the overlapped ghost zone code can generated by a series of algorithms such as searching the size of thread block and allocating the shared memory.In the project,control flow divergence produced by conditional sentence loading overlapped ghost zone in iterating tilings is eliminated.Generally 32 threads as one warp in Nvidia GPU or 64 threads as one thread group in AMD GPU are scheduled on GPU.However,if all these threads execute in different paths,namely out of step among them due to the conditional sentence,much overhead exists.The control divergence can be removed by the new stencil data loading algorithm.In the project,bandwidth resource is made fully available.When loading values of iterating tilings the coalesced memory accesses decrease the required bandwidth.In thread execution the sub-sum and intermediate data can be loaded into registers due to the balanced utilization of storage in the shared memory and registers.In this way,more threads can launch and more parallel granularity is added.Finally,in software prefetch mechanism arithmetic operation can further overlap delay in memory access on the basis of the overlapped delay in warp scheduling or threads group scheduling.In this optimized policy,the required input data in the next iteration derives from the loaded data in current iteration.

Keywords/Search Tags:

Stencil Computation, Coalesced Memory Accesses, Iterating Tilings, Ghost Zone, Thread block, Control Divergence, Stencil Specification, Domain-specific Language

PDF Full Text Request

Related items

1	Research On The Performance Optimizations For Stencil Computations On ARM High-performance Processor
2	Optimizations Of Memory-access For Stencil Computations On Shared-memory Multi-core Processor
3	Parallelization And Locality Optimization For Red-black Gauss-Seidel Stencil
4	Piezoflexure-enabled nanofabrication using translated stencil masks
5	A Research On The Automatic Image Measurement System For SMT Stencil
6	Research On Automatic Optical Inspection System Of SMT Stencil
7	Research On Periodic Decision-making Method Of Stencil Printing Machine Stencil Cleaning
8	The Study Based On Characteristics Of Stencil Opening Hole In The Solder Paste Deposition Process
9	Research On Stencil Defects System Based On Machine Vision
10	Research On Performance Optimizations Of Stencil Computations On Domestic Heterogeneous Many-core Processor