Font Size: a A A

Critical branches and lucky loads in Control-Independence architectures

Posted on:2010-05-31Degree:Ph.DType:Dissertation
University:University of Illinois at Urbana-ChampaignCandidate:Malik, KshitizFull Text:PDF
GTID:1442390002483011Subject:Engineering
Abstract/Summary:
Branch mispredicts have a first-order impact on the performance of integer applications. Control Independence (CI) architectures aim to overlap the penalties of mispredicted branches with useful execution by spawning control-independent work as separate threads. Although control independent, such threads may consume register and memory values produced by preceding threads. This dissertation presents an efficient and complexity-effective mechanism for synchronizing inter-thread register dependences in CI architectures. The mechanism is further extended to synchronize memory dependences by treating store-set IDs as analogous to architectural registers.;The performance of CI architectures is limited by mispredicted branches that have data-flow dependences crossing thread boundaries, called critical branches. Critical branches increase the average mispredict penalties suffered by CI architectures, resulting in a decrease in correct-path instruction bandwidth. I propose hardware mechanisms that alleviate the critical branch problem. First, I modify the register synchronization mechanism to remove false dependences that arise from saves and restores to callee-saved registers. Second, I find that the store-set predictor introduces dependences between loads and stores that alias quite infrequently; I modify the memory synchronization mechanism so that it finds the right balance between synchronization and speculation on a per-load granularity. Finally, I show that other mechanisms, such as a critical branch aware spawn policy, can also alleviate the performance loss from critical branches. As a result of these optimizations, a four-core CI architecture is able to attain a speedup of up to 90% over a single core, although with an aggressive and hard-to-implement memory backend that performs associative searches across the whole queue.;I find that a CI processor that uses the more implementable and widely accepted cache-coherence-based disambiguation and forwarding (CC-DF) suffers a severe slowdown. The CI processor also suffers a large performance hit when using a recently proposed mechanism for disambiguation, called Bulk. In both these cases, most of the performance reduction can be attributed to a small set of instructions across which the memory synchronization mechanism deemed it profitable to speculate, called lucky loads. Because of lucky loads, the performance of CI processors is extremely sensitive to the mechanism used for inter-thread forwarding and disambiguation. With a conservative memory backend like CC-DF or Bulk, the adaptive memory synchronization mechanism is forced to become less speculative and synchronizes a larger fraction of loads, thereby reducing performance.;I perform a thorough analysis of the performance sensitivity of CI processors to disambiguation and forwarding. The insights from this analysis are used to drive the design of hardware mechanisms to perform these two functions that are low in complexity and yet attain high performance. The basic premise behind these mechanisms is to use small caches to perform early disambiguation and forwarding. These caches are not responsible for ensuring correctness; they merely enable high performance in the presence of lucky loads. The caches are backed up by a simple load re-execution mechanism that guarantees correctness. I find that the performance of a CI processor with small (32-entry and 128-entry) structures for disambiguation and forwarding, respectively, is within 10% of global load and store queues in the worst case.
Keywords/Search Tags:Critical branches, Lucky loads, Architectures, Performance, CI processor, Disambiguation and forwarding, Memory synchronization mechanism
Related items