Research For The Implementation And Optimization Technology Of Typical Image Processing Algorithms On Xeon Phi

Posted on:2014-09-01

Degree:Master

Type:Thesis

Country:China

Candidate:J Qi

Full Text:PDF

GTID:2308330479479109

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rising of heterogeneous system, the High Performance Computation(HPC) domain develops greatly. Heterogeneous system based on GPU+GPU is applied broadly in many fields, such as bioinformatics, medical imaging, and computational fluid mechanics(CFD) and so on. However, CPU and GPU use different instruction sets and programming models which call for higher requirement to program and optimize an application. Hence, in 2012 Intel proposes Xeon Phi coprocessor based on Many Integrated Cores(MIC) architecture,which relieves the difficulty of programming via inheriting the traditional programming models and characteristics of x86. In addition to this, Xeon Phi integrates over 50 lightweight x86 cores. Each core supports 4 hardware threads and contains a SIMD 512-bit wide Vector Processing Unit(VPU). Therefore,Xeon Phi provides a powerful parallel processing ability. However, the research of optimizing algorithms on Xeon Phi is in its fancy at present.In this thesis, we study on how to implement and accelerate two typical image processing algorithms on Xeon Phi platform. The image processing algorithms demand a high performance due to huge amount of data and high real-time requirement. Subsequently, we select two representative algorithms which are 2D IDCT algorithm and 3D GVF field algorithm respectively as our case study on Xeon Phi.Our main contributions are as follows:(1)Porting 2D IDCT algorithm to Xeon Phi and optimizing the algorithm on it. Firstly, we implement the serial version of the algorithm in principle of a row-column separation method. We regard the performance of the serial version as a reference for the implementation with the following optimizations. Then, we extend and vectorize the serial implementation with multithread standard OpenMP and 512-bit SIMD intrinsic provided by Intel respectively. At last, we further optimize the previous implementation(thread extension and vectorization) with data pre-fetching. The test shows that, the vectorization wins a performance of 5.82 X speedup for the processing of single precision image compared with the implementation without vectorization, and the performance of the algorithm increases nearly with a linear speed as the thread extends; besides, the data pre-fetching boosts the algorithm by about 24% performance. Combining all of these optimizations, the best performance for the algorithm on Xeon Phi is about 1.53 times to the performance achieved on one E5-2670 CPU.(2)Porting 3D GVF field algorithm to Xeon Phi and optimizing the 3D GVF field algorithm on the platform. In addition to discussing the general optimizations such as vectorizing and thread-extending, we focus on the impact of optimizations for stencil computation on the algorithm’s performance. We design an efficient loop tiling strategy, which improves the cache utilization, to reduce the performance loss. The test shows that, the 3D GVF field computation for double precision image obviously obtains performance growth; through the loop tiling strategy proposed in this thesis, the algorithm achieves the best performance on Xeon Phi with a speedup of 1.78 ? and 2.77 ? for the problem scale of ??256256256 and ??512512512 respectively compared with the best performance achieved on one E5-2670 CPU(3)Summarizing the optimization law for image processing algorithms on Xeon Phi and drawing the techniques which provide guidance and benefit for the optimizations of other image processing algorithms. In general, for the intensive computation algorithms, a good performance can be obtained via the basic optimizing techniques directly such as vectorizing and thread-extending; whereas, for the algorithms wit a low computation-access ratio, increasing cache utilization should be accounted first, the loop tiling method proposed in this thesis can do this work well.

Keywords/Search Tags:

Xeon Phi, IDCT, 3D GVF field vector, vectorizing, thread-extending, data pre-fetching, loop tiling

PDF Full Text Request

Related items

1	Research On Data Pre-fetching Techniques For Loop-level Array References
2	Research On Graph Calculation Based On Xeon Phi Coprocessor
3	The Research On Program Optimization Techniques Of Embedded Image Compression System
4	Research On Loop Optimization Based On Polyhedral Model
5	Study On Server-Side Web Pre-fetching Based On Data Mining
6	Research On Low Power Techniques Of The Instruction Fetching Unit In Embedded Processors
7	User Online Behavior Vectorizing Model And Its Application
8	Field Extending Technique Based On Wave-front Coding Theory
9	The Research And Implementation Of The Key Techniques On Single Chip Multiprocessors
10	Many-thread Wide Vector Modeling And Performance Analysis