Font Size: a A A

Exploiting Thread-Level Parallelism on Reconfigurable Architectures: a Cross-Layer Approac

Posted on:2018-12-07Degree:Ph.DType:Thesis
University:Northeastern UniversityCandidate:Momeni, AmirFull Text:PDF
GTID:2448390002950959Subject:Computer Engineering
Abstract/Summary:
Field Programmable Gate Arrays (FPGAs) are one major class of architectures commonly used in parallel computing systems. FPGAs provide a massive number (i.e., millions) of programmable logic blocks and I/O cells, as well as programmable interconnects, which can be con- figured for a particular application. This reconfigurable architecture is flexible and power efficient, and potentially, provides better floating-point operations per watt rates versus general purpose architectures, such as CPUs and GPUs. However, programming an FPGA can be challenging and time- consuming, requiring hardware description language (HDL) experience and digital design expertise. High-level synthesis (HLS) tools have been designed to ease the FPGA programming task by generating HDL (e.g., Verilog or VHDL) codes from high-level languages (e.g., C/C++, OpenCL). In particular, there have been recent developments in OpenCL-based HLS tools (OpenCL-HLS) to en- able programmers to construct a customized data-path that can best match a parallel application, relieving the programmer of many implementation details.;Given the availability of OpenCL-HLS tools for FPGAs creates many new opportunities, as well presents new challenges, in order to fully utilize these new capabilities. The primary challenge lies in the difference between the OpenCL parallelism semantics and parallel execution model on FPGA devices. OpenCL is primarily developed for GPU devices, which have many spatially- parallel cores. We need to explore new classes of optimization in order to fully leverage OpenCL execution on FPGAs.;This thesis explores and addresses OpenCL-HLS challenges using three different approaches. In the first approach we consider source-level optimization, where we evaluate the impact of OpenCL source-level decisions on the resulting data-path and FPGA execution efficiency. Our aim is to analyze the correlation between OpenCL parallelism semantics and parallel execution on FPGA devices. We want to be able to guide OpenCL programmers to develop optimized code on an FPGA. We study the impact of different grains (fine and coarse-grained), and forms of parallelism (spatial and temporal), exposed by OpenCL on the generated data-path. We also study the efficiency of the OpenCL Pipe semantic when targeting an FPGA.;In the second approach called synthesis optimization, we introduce novel optimization techniques for synthesis of OpenCL kernels targeted for FPGA devices. We propose a Hardware Thread Reordering (HTR) technique to improve the performance of irregular kernels. The goal is to guide OpenCL-HLS tool developers to design a more efficient data-path for a given OpenCL kernel. Using our HTR technique, we achieve up to a 11X speed-up, with less than a 2X increase in resource utilization.;In our third approach called architectural approach, we propose a novel device named an FP-GPU (field-programmable GPU), a new class of architecture that utilizes the benefits of both GPU and FPGA architectures. FP-GPU utilizes the GPU memory hierarchy, but introduces a novel thread switching mechanism, which helps to hide long memory latencies. The FP-GPU device includes reconfigurable fabric that can serve as an application-specific compute unit, maximizing the efficiency of OpenCL kernel execution. Our evaluation of FP-GPU finds that we can achieve up to a 4x speed-up, while utilizing 88% less resources as compared to a general-purpose GPU.
Keywords/Search Tags:FPGA, Parallel, Architectures, GPU, Opencl, Reconfigurable, Fpgas
Related items