Resource underutilization exploitation for power efficient and reliable throughput processo

Posted on:2016-07-27

Degree:Ph.D

Type:Dissertation

University:University of Southern California

Candidate:Jeon, Hyeran

Full Text:PDF

GTID:1478390017480687

Subject:Computer Engineering

Abstract/Summary:

PDF Full Text Request

The continuing march of Moore's law, in spite of many prior dire predictions, enables chip designs with tens of billions of transistors today. But as Dennard's scaling slows irrefutably, power consumption has become the first order design constraint. Furthermore, with device scaling, reliability has also come to the forefront of design considerations. To avoid excessive power consumption, chip industry has shifted away from high performance single threaded designs to high throughput multi-threaded designs. Nowhere is this design trend so starkly visible than in a Graphics Processing Unit (GPU) design. GPUs are provisioned with hundreds of execution units and mega bytes of register file to run thousands of threads concurrently. Their high throughput and excellent performance per watt has attracted efforts to port general purpose applications to run on GPUs. Hence, a new computing paradigm called general purpose computing on GPUs (GPGPU computing) has emerged. When GPUs execute general purpose code with irregular parallelism, the massive on-chip resources available for concurrent thread execution become underutilized. This dissertation presents two mechanisms that exploit the resource underutilization for improving power efficiency and reliability.;The first mechanism proposes register file virtualization. This approach is motivated by the observation that at any given instance during an application execution, only a fraction of the total allocated registers carry live data. By eagerly deallocating registers with dead data, these registers can then be reassigned to new threads. Our scheme takes advantage of register liveness information to allow a flexible mapping between architected registers and their corresponding physical register allocation. Register virtualization tackles the inefficiency of existing GPU register management method that is the root cause of power and imbalanced wearleveling problems. By exploring different mapping algorithms, register virtualization can improve power efficiency or improve GPU reliability. Our results show that the register virtualization effectively reduces the register demand and imbalanced wearleveling problem.;Inspired by the reduced demand on register file when using register virtualization, we also proposed a more aggressive mechanism, GPU-Shrink, that under-provisions the register file by as much as 50% of the current GPU register file size. GPU-Shrink guarantees deadlock-free application execution with a slightly modified warp scheduler. The new warp scheduler reserves minimum number of available registers to guarantee the progress of at least one thread block within an application. Our results show that GPUShrink effectively reduces register file's dynamic and static power with negligible performance overhead.;The second mechanism exploits execution unit underutilization to improve GPU reliability. Due to branch and memory divergence, several execution lanes in a GPU are left idle. We proposed Warped-DMR to reuse the idle cores to verify the execution on active lanes. Dual modular redundancy (DMR) has been long used for execution verification in CPUs. However, unlike traditional DMR that adds a dedicated checker core for each core to be verified, Warped-DMR repurposes idle execution lanes for opportunistic execution verification. Hence, Warped-DMR needs zero extra execution lanes. Our results show that the Warped-DMR can verify almost all the instructions' execution without significant performance and power overhead.

Keywords/Search Tags:

Power, Execution, Register, Results show, GPU, Underutilization, Throughput, Lanes

PDF Full Text Request

Related items

1	Automated defect recognition in digital radiography
2	Capacity and Coverage Analysis for Multihop Relay-Enhanced WiMAX Networks
3	Using Ontology Fingerprint to enhance analysis of high throughput experimental results
4	A 32-word by 32-bit three-port bipolar register file implemented using a silicon germanium HBT BiCMOS technology
5	Compiler-based Improvements To Register Allocation Strategies
6	Low power encoding techniques for memory and video subsystems
7	Graduate Statistics And The Results Show
8	Automated analysis of load testing results
9	Study And Design For A Low Power Successive Approximation Register ADC
10	The Research And Implementation Of Predicated Execution