Font Size: a A A

Accelerating Parallel Micro-Architectural Simulation Through Sampling Methodology

Posted on:2017-02-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:C T JiangFull Text:PDF
GTID:1318330485450835Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Computer micro-architectural simulation serves as a significant role in the computer architecture design both in academia and in industry. Computer architects heavily rely on the simulation technique to explore design space. Unfortunately, simulation speed has since long been a major challenge for several decades. In the multi-core era, this problem is further exacerbated because (i) larger design space of multi-core systems needs to be explored, and (ii) more complicated and longer running multi-threaded benchmarks need to be simulated to better evaluate and stress multi-core systems. Therefore, speeding up multi-core micro-architectural simulation is of crucial importance.Sampling is a well-known and widely used simulation acceleration technique that can dramatically improve simulation speed by simulating a limited number of sampling units, from which overall application performance is estimated. Sampled simulation of single-threaded applications on single-core processors is a mature technology. This technology defines sampling units by instruction count, i.e., a sampling unit refers to a sequence of dynamic instruction stream as a unit of work, hence the name Instruction-Based Sampling (IBS). However, IBS works not well for sampled simulation of multi-threaded applications on multi-core systems because the dynamic instruction stream may vary across runs due to non-determinism and synchronization activities. In other words, for multi-threaded applications, a unit of work can no longer be defined based on instruction count because the rule of fixed dynamic instruction stream is broken. Therefore, a new sampling scheme is proposed, named Time-Based Sampling (TBS), which selects sampling units based on execution time to estimate a multi-threaded application’s total running time. Due to the complexity of multi-threaded workloads and multi-core systems, TBS is more challenging and complicated than IBS, for example, the sample selection, the simulation of synchronization, the warmup and so on. This dissertation makes a comprehensive study of TBS technology, and stresses its challenging issues well.First, we propose a novel sampling methodology based on the fractal behavior of multi-threaded applications, which solves the challenging sampling unit selection problem in TBS. The new sampling technique (PCantorSim) avoids the complicated preprocessing procedure, and can be widely applied to many workloads. Specifically, PCantorSim proves that besides the phase behavior, there also exists fractal behavior during execution time of multi-threaded applications. In other words, the program behavior has self-similarity property at different time scales. Based on this observation, the proposed PCantorSim is able to select representative sampling units very quickly and accurately. By running the PARSEC benchmarks on a simulated 8-core system, the results show that PCantorSim increases simulation speed over detailed parallel simulation by a factor of 20*, with an average absolute execution time prediction error of 5.3%.Second, we revisit and analyze prior TBS approaches comprehensively and obtain a number of novel and surprising insights, such as (1) accurately estimating fast-forwarding IPC is more important than accurately estimating sample IPC; (2) fast-forwarding IPC estimation accuracy is determined by both the sampling unit distribution and how to use the sampling units to predict fast-forwarding IPC; (3) fractal-based sampling is more accurate at small sampling unit sizes, whereas periodic is more accurate at large sampling unit sizes; (4) random sampling is inappropriate for TBS. These insights lead to the development of Two-level Hybrid Sampling (THS). THS achieves an average absolute execution time prediction error of 4% while yielding an average simulation speedup of 40×compared to detailed simulation. Case studies illustrate that THS is able to accurately predict relative performance differences across the design space.Last, we propose SOL (Shorter On-Line) warmup strategy, which is able to significantly reduce the warmup cost in TBS. SOL extends and combines Prime warmup and NSL (No-State-Loss) technique to substantially improve simulation speed, while providing almost perfect warmup results. We explore different Prime strategies and different NSL configurations for SOL, and determine the most appropriate parameters that can provide a good trade-off between performance prediction accuracy and simulation speed. The experimental results show that SOL can be widely used in different TBS sampling methodologies, maintaining the simulation accuracy and improving the simulation speed.
Keywords/Search Tags:Micro-architectural Simulation, Multi-core Processor, Multi-threaded Applications, Sampling, Fractal Theory, Warmup, Performance Evaluation
PDF Full Text Request
Related items