Font Size: a A A

Research On Key Techniques And Applications Of Accelerated Monte Carlo Simulation With FPGA

Posted on:2014-09-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:1220330479979517Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
The Monte Carlo method is widely used in a large number of scientifical applications, including molecular physics simulation, financial modeling and biomedical simulation. With rapid development of science, the practical problems become increasingly complicated, causing the Monte Carlo computations to become more and more time-consuming. Thus accelerating Monte Carlo simulation is becoming an increasingly important issue in designing modern systems. Field-Programmable Gate Arrays(FPGAs) have advantages over CPUs and GPUs in terms of reconfigurable, bit-wise parallelization, hign performance and low power consumption. Thus the use of FPGAs is becoming a promising solution for accelerating Monte Carlo simulations. However, there are some challenges to overcome: the diversity of algorithms, parallel algorithm design and optimal hardware structure customization. Exsisting FPGA structures consume considerable resources, have limitations in the parallel structure, and most of them are customized for specific algorithms.This thesis emphasizes accelerating Monte Carlo computations using FPGA-based techniques. In summary, this thesis makes the following contributions:1. We propose an FPGA-based architecture for efficient implementation of the well equidistributed long-period linear(WELL) algorithm. This structure is capable of generating high quality, long period uniform random numbers with throughput of 1 sample per cycle. A dedicated 6R/2W RAM structure is also proposed. Utilizing a BRAM-and-register-hybrid structure, the RAM is capable of providing 6 reads and 2 writes concurrently in a single clock cycle and is the key component for the entire system to achieve the expected throughput. Experimental results show that our design outperforms general-purpose processors and previous work. To the best of our knowledge, no hardware implementations for WELL method exist in the literature.2. We develop an automatic word-length determination tool(SATRANS) based on the Simulated Annealing algorithm. SATRANS can transform the system from Floating-Point model to Fixed-Point model and provide a series of word-length solutions that form a tradeoff balance between hardware complexity and signal quality. We adopt the basic 64-bit long integer data type of the C language with corresponding add, subtract and multiply as well as shift and mask operations to simulate Fixed-Point computation, which greatly accelerates the execution of the word-length optimization process. We demonstrate the use of SATRANS to find word-length for an infinite impulse response(IIR) filter. Experimental results show that SATRANS can provide better word-length solution in comparison to the traditional search methods based on the Greedy Strategy. SATRANS is also successfully used in the Gaussian random number Generation systems and the Credit Default Swap(CDS) pricing system.3. An FPGA-based long period Gaussian random number generation framework is presented. By fully studying the common properties of different Gaussian variable generation algorithms, a common design flow for developing FPGA-based Gaussian Random Number Generator(GRNG) is proposed. The design flow is successfully used for developing GRNGs based on the Box Muller and the Monty Python methods, respectively. For the Box Muller design, we adopt a piecewise polynomial to approximate elementary functions such as sin(x)/cos(x), ln(x) and sqrt(x). SATRANS is also used for word-length optimization. Experimantal results show that our Box Muller design is capable of producing 2 samples per cycle. The performance is 12.5 times faster than a dedicated software version running on a 2.67 GHz Intel Core i5 processor. It outperforms previous work in period and throughput-area ratio, and is superior to Floating-Point implementations in both performance and resource usage. For the Monty Python design, to improve the throughput, we devise a four-phase architecture, which contains a fully pipelined structure to handle the main path as well as an iterative dedicated unit to deal with the tail region. These two structures work in parallel. Experimantal results show that our Monty Python design achieves a throughput of almost 1 sample per cycle. The performance is 24.8 times faster than an Intel Core i5 general-purpose processor. It has longer period and better performance compared to previous work and better performance-resource usage ratio compared to Floating-Point implementations.4. We propose a software/hardware framework for generating uniform random numbers in parallel. Using the Fast Jump Ahead technique, the software can produce initial states for each generator to guarantee independence of different substreams. With support from the software, the hardware structure can be easily constructed by simply replicating the single generator. To improve the performance of the software, an algorithm to efficiently derive the characteristic polynomial cp(z) of F2-Linear is also proposed. Experimental results show that this framework can obtain speedup roughly proportional to the number of parallel cores. Statistical tests of interleaved sequences are performed to check for correlations between different substreams of the parallel framework. Furthermore, we implement 149 parallel instances of WELL19937 generators on a Xilinx Virtex-5 FPGA device. It runs as fast as 279 MHz. Compared to CPU and GPU implementations, the throughput is 9.8 and 2.5 times faster, respectively, while the throughput-power efficiency achieves 194.9 and 21.1 times speedup, respectively. We also apply our framework to two practical applications: a Monte-Carlo simulation for estimating π and a Gaussian variable generator. Experimental results verify the correctness of our framework as well as the superior characteristics of the WELL algorithm.5. A software/hardware parallel framework for accelerating a Monte-Carlo based CDS pricing model is presented. Tasks are divided between software and hardware as well as different simulation cores to perform in parallel. The latencies of the data transfer between software and hardware are hidden in the computation in order to improve the performance-resource cost ratio. The framework has very good scalability and the accuracy of the hardware is verified by comparing with the standard software. An implementation of the structure with 30 simulation cores achieves 358-fold speed up compared to the standard software running on a 2.67 GHz Intel Core i5 processor.
Keywords/Search Tags:Monte-Carlo computation, FPGA Accelerating, uniform random number, Gaussian random number, parallel random number generator, word-length optimization, CDS pricing model
PDF Full Text Request
Related items