Research On The Hardware Acceleration For High-precision Algorithm Based-on Very Long Instruction Word Framework

Posted on:2013-05-30

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y W Lei

Full Text:PDF

GTID:1268330392473818

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Scentific computing becomes the third mode for scientific discovery beyond theoryand experient. Most of them operate on floating-point arithmetic, in which roundingerror is an unavoidable consequence. And the accumulation of rounding errors leads toinaccurate, unreliable and even wrong results. Thus, many scientific applications rely onthe high-precision arithmetic. However, the performance of high-precision arithmetic ingeneral-purpose processor is very poor since most of them are accomplished bysoftware emulation with fixed-precision operations, such as64-bit floating-point.Field-Programmable Gate Arrays (FPGAs) have advantages over CPU in terms ofcustomizable, reconfigurable, performance, and power consumption, so the use ofFPGA-based accelerators has become a promising approach for speed up scientificapplications. In this thesis, we implement high-precision floating-point arithmetic onFPGAs to explore the capability and flexibility of FPGA solutions in sense to acceleratehigh-precision scientific applications. In summary, this thesis makes the followingcontributions:(1) We propose a parameterizable Very Long Instruction Word (VLIW) frameworkon FPGAs, which features with less hardware complexity, high performance, and highscalability. Based on this formwork, a hardware accelerator with multiple VLIW kernelsis presented to exploit instruction level parallelism (ILP) and thread level parallel (TLP)in high-precision applications simultaneously. In order to solve the code densityproblem in VLIW implementation, we propose a mult-level index code compressionscheme for custom VLIW framework on FPGAs. For each unit, a flag is used toindicate whether this unit is used and a RAM is built to store the used operation. Thisscheme can solve the uncertain length of VLIW instruction in tradition codecompression method and avoid explicit no-ops fully.(2) We propose exact vector inner product algorithm and structure (Quad-HPMAC)for IEEE-754(2008) standard quadruple precision floating-point arithmetic. A very longfixed-point register is employed to store the summation without information loss andexact fixed-point operations, instead of floating-point operations, are used to gain exactresults. Several schemes, such as two-level RAM banks structure for summation, partialsummation scheme, and carry-save accumulation scheme, are introduced to improve thefrequency and throughput of Quad-HPMAC unit. Finally, a prototype of the unifiedmatrix accelerator, equipped with4Quad-HPMAC units, is presented to implementtypical quadruple precision matrix computation algorithms, such as matrixmultiplication, LU decomposition, and MGS-QR decomposition. Experimental resultsshow that our design outperforms general-purpose processors in terms of precision,performance, and power consumption. (3) We propose a special-purpose processor (QP_VELP) based on the customVLIW framework, which used the unified hardware to efficiently evaluate variousquadruple precision elementary functions. This processor is well match up to thefeatures of elementary functions in scientific applications, such as high implementationcomplexity, low use frequency, and high latency. The pipelined implementation ofpolynomial approximation with Estrin scheme is addressed to enhance the ILP. Theperformance of QP_VELP is improved through loop unrolling technique and explicitlyparallel of VLIW instruction. Compared to the related work, our design achieves higherprecision and lower latency with less resource consumption. Moreover, our solution forelementary functions can achieve high resource utilization.(4) Taking the orbit prediction algorithm of spatial object (SGP4/SDP4) as anexample, we present a VLIW-based architecture for quadruple precision scientificapplications. The QP_VELP unit is integrated into this accelerator to implement variouselementary functions in SGP4/SDP4with the unified hardware. Multiple basicquadruple precision operation units in this accelerator can be executed in parallel toexploit the ILP and TLP in SGP4/SDP4. Meanwhile, we propose a greedy algorithm,which schedules the operations in the data flow graph of SGP4/SDP4algorithm into thecustom VLIW instruction, and generates the VLIW instruction sequence with littleno-ops. Experimental results show that our VLIW-based accelerator exhibits speedupperformance and power advantage compared to general-purpose processor.(5) We extend the concept, research method, and implementation scheme in thedesign of quadruple precision algorithm accelerator to arbitrary precision arithmeticsystem. First, we address the exact vector inner product structure (VPMAC) forarbitrary precision floating-point arithmetic, which uses the exact fixed-point operationto avoid the introduction of rounding errors. Then, we address the processor (VP_VELP)based on the custom VLIW framework for arbitrary precision elementary functions. Theperformance of VP_VELP is improved through the explicitly parallel technology ofVLIW instruction and by dynamically varying the precision of intermediatecomputation. Finally, two schemes, called the VPMAC coprocessor and the unifiedmatrix accelerator (VPMATA), are presented to accelerate the typical arbitraryprecision matrix computation algorithms. Experimental results show that the VPMATA,equipped with8VPMAC units and1VP_VELP unit, achieves13X-63X betterperformance.

Keywords/Search Tags:

high-precision arithmetic, reconfigurable computing, Very LongInstruction Word, vector inner product, elementary function

PDF Full Text Request

Related items

1	Research On Several Frequently-used Algorithms And Their Implementation For Reconfigurable System
2	Research On High-Performance Arithmetic For Floating-point Division And The Elementary Functions
3	Variable long-precision arithmetic (VLPA) for reconfigurable coprocessor architectures
4	Key Compilation Techniques For High Productivity Computing: Precision, Performance And Power Consumption
5	Function based heuristics to develop reconfigurable and multifunctional products
6	Design Of Support Vector Machine Accelerator Based On Reconfigurable Computing Platform
7	High-efficiency Reconfigurable Array Computing: Architecture, Methodology And Application Mapping Technology
8	Study On Arithmetic P System Based On DNA Computing
9	High-performance arithmetic for division and the elementary functions
10	Design And Implementation Of Lightweight Hash Function Reconfigurable Architecture