Research On Multi-threaded Compilation Optimization Techniques For Masterslave Hybrid Architecture Of CPU

Posted on:2022-10-04

Degree:Doctor

Type:Dissertation

Country:China

Candidate:K Nie

Full Text:PDF

GTID:1488306731498134

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Supercomputers play a major role in industrial manufacturing,disease research,natural disaster prediction,aerospace and other aspects.They are called "national weapons" and are the technological commanding heights that countries all over the world are chasing.After the longterm unremitting efforts of researchers,my country has made great achievements in domestic hardware.The parallel scale of domestic high-performance processors has been increasing,and its peak computing power has continued to increase.In order to give full play to the advantages of hardware,the necessary way is through compilation and optimization.To generate efficient code matching domestic processors.Based on the basic language compiler(SWGCC)on the domestic "Sunway�Taihulight" supercomputer,this thesis develops the research on multithreaded parallel compilation optimization technology for the Sunway master-slave hybrid architecture processor.For the purpose of improving the parallel efficiency of the automatic thread-level parallelization system on the Sunway platform,the thread-level parallel recognition method and efficient code generation method are studied.The main works of the paper include as follows:(1)A thread-level parallelism mining algorithm with multiple nested loops is proposed.When there are multiple nested loops in the hot code segment of the program,it is an NPC problem to select the smallest set of loop layers from the multiple nested loops to cover all the dependencies,so as to obtain the largest parallel granularity.In this thesis,genetic algorithm is used to solve the multi-level nested loop parallel loop selection problem.First,the multi-level nested loop parallel loop selection problem is formally described,and the feasible solution,constraint conditions and optimal solution form of the problem are given,and then Designed the chromosome representation method and fitness function calculation method in the genetic algorithm operation process,the initial population generation algorithm and the improvement method of chromosome,the specific implementation method of genetic operators(crossover,mutation and selection),the generation method of the new generation of population and The criteria for stopping the genetic algorithm,and finally the specific execution steps for using the genetic algorithm to solve the problem.(2)A compressed version of thread-level alignment code generation algorithm is proposed.The loop alignment code generation algorithm in the original automatic thread-level parallelization system,the generated code contains redundant conditional branch statements and extra loop iterations,which reduces the execution efficiency of parallel code.In order to solve this problem,this article re-categorizes the loop statements that need to be aligned in the loop,and proposes a compressed version of the alignment code generation algorithm to eliminate redundant conditional judgment statements and additional loop iteration overhead.(3)Two methods to improve the efficiency of thread-level parallel code are proposed.The automatic thread-level parallelization system in the Sunway platform adopts the fork-join model of the Open MP standard for parallel implementation.There are multiple parallel regions in the program,the creation and termination of thread groups are frequent,and the additional overhead is large.This thesis proposes a parallel region reconstruction optimization technique,which merges multiple small parallel regions in the fork-join model into a large parallel region,reduces the management and control overhead of the thread group,and improves the efficiency of threadlevel parallel code.On the other hand,the realization of thread private variables adopts thread local storage technique,which is dependent on the operating system,runtime library,and compiler,and has serious portability problems.At the same time,the access of thread private variables needs the interface provided by the operating system kernel to complete,the access speed is slow,and it has a greater impact on the performance of the application.In response to these shortcomings,this thesis proposes a thread private variable access technique based on Sunway processor privileged instructions,which directly reads the starting address information of thread private variables without entering the core processing of the operating system,improving the access speed of thread private variables while avoiding portability issues caused by platform environment version upgrades.(4)A thread-level parallel scheduling strategy oriented to the Sunway platform is proposed.The default cyclic scheduling method in the automatic thread-level parallelization system of the Sunway platform is dynamic scheduling.The main advantage of dynamic scheduling is to seek load balancing.In many cases,its larger scheduling overhead will exceed the benefits of parallelism.The static scheduling method is usually difficult to ensure load balance.This thesis proposes a scheduling method that combines static scheduling and dynamic scheduling to reduce the impact of scheduling overhead while maintaining load balance.At the same time,considering the impact of the selection of scheduling parameters on data locality and software pipelines,a constraint matrix for the selection of scheduling parameters is given.All the methods and technologies in this thesis have been implemented in the automatic thread-level parallelization system of the Sunway platform,and the SPEC CPU2006 and NPB3.3.1-SER benchmark tests have been carried out on the "Sunway�Taihulight" supercomputer as the experimental platform.A performance improvement of 8% and 11% has been achieved,and the experimental results verify the correctness and effectiveness of the method in this thesis.

Keywords/Search Tags:

master-slave hybrid architecture processor, compilation optimization, thread level parallelism, multilayer nested loop, code alignment, scheduling strategy

PDF Full Text Request

Related items

1	The Research And Implementation Of The Key Techniques On Single Chip Multiprocessors
2	Research On Parallel Model And Compiler Optimization Technique Based On Multi-core
3	Loop Realization And Optimization Based On X Stream Processor
4	Research On BGP Parallelism Technologies For Multicore And Multi-threading
5	The Research And Implementation Of Key Techniques On Block Cipher ASIP
6	Parallel Algorithm Design And Optimization For H.264 Video Encoding
7	Research On The Teleoperation Motion Control Strategy For A Master-slave Minimally Invasive Surgical Robot
8	Research On Memory-level Parallelism For Multi-core Microprocessor Chip
9	SEMM: A Scheduler For Embedded Master-slave Multi-core Microprocessors
10	Analysis And Optimization Of SIMT Thread Scheduling Model