Research On Adaptation Method Of Compute-Intensive Algorithms For Domestic Processors

Posted on:2024-04-16

Degree:Master

Type:Thesis

Country:China

Candidate:T Wang

Full Text:PDF

GTID:2568307160959229

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

With the continuous development of high-complexity algorithms such as hashing high-security encryption and deep learning,the demand for processor computing power has reached the order of one trillion floating-point operations per second.For domestic high-performance processors,it is necessary to solve the problem of compatibility between algorithms and processor structures,and design better software algorithms to utilize processor hardware more effectively.In this paper,we focus on the adaptation of compute-intensive algorithms to domestic high-performance processors,including:1)Efficient implementation of hash encryption operators for domestic many-core processors;2)Design heuristic method for instruction reordering optimization;and 3)deep learning convolution operator adaptation scheme for domestic artificial intelligence processor.Firstly,we design and implement the hash core operator library for Sunway manycore processor.Based on the three parallel dimensions of thread level,data level,and instruction level,we propose a hash password recovery subtask allocation scheme and a multi-thread optimization scheme.We design an algorithm compilation implementation method and an instruction sequencing scheme,and finally realized the parallel computing of 512 passwords.Compared with the C programming language implementation,the computational speed of the core operators optimized by assembly is improved by 44.5%～76.7%.Compared with the assembly implementation,the computing time of the core operators implemented by data parallel optimization is reduced by 84.3%～85.3%.The computing speed of dictionary password recovery algorithm based on parallel optimization scheme in this paper is 9.66 to 14.01 times faster than Hashcat on AMD Radeon Pro WX 3200 Series GPU platform.Secondly,a heuristic method for reordering assembly instructions is proposed to adapt to the microarchitecture of multi-issue processors.The objective function of optimization is to minimize the execution time of an instruction fragment on the target processor,and a heuristic method is proposed to confirm the priority of instruction issue during the process of instruction reordering.By simulating the hardware execution behavior of the code fragment,the impact of the code itself and the processor hardware on the execution time is considered.Through comparative experiments,it is verified that the instruction reordering method proposed in this paper is effective in solving the problem of performance degradation caused by program migration.Compared with the GCC compilation "-O3" level optimization,the calculation speed of the reordered program generated by using the heuristic method in this paper is increased by 30.0%.Compared with the list scheduling algorithm,the time performance of the reordered program generated by the proposed algorithm is improved by 8.8%～13.9%.Finally,an efficient adaptation scheme of convolution operators for domestic artificial intelligence dedicated processor is proposed.Aiming at the problem of limited local memory space of processors,we explore the optimization effects of multi-thread parallel design,data mapping design and memory access optimization on the performance of convolution operators,and analyze the performance improvement brought by the design of dedicated components.In this paper,a double buffer control mechanism is designed to hide the memory access delay,and the total computing time is reduced by 55.4% compared with the original optimization.A data storage and reading scheme is designed in the process of matrix multiplication data format conversion.Compared with the independent format conversion method,the convolution computing speed of the data direct mapping method proposed in this paper is 2.0 to 4.7 times faster.Finally,we discuss the influence of dimension limit of systolic array on data mapping mode and data computing speed,and the influence of data transmission address boundary limit on input data dimension requirements and data transmission speed,and put forward the corresponding solutions.

Keywords/Search Tags:

domestic processor, hash encryption algorithm, instruction rearrangement, convolution operator

PDF Full Text Request

Related items

1	Design And Implementation Of Vector Math Library Based On Domestic Processor
2	Instruction Set Design For Encryption Application Specific Instruction Set Processor
3	The Design And Implement Of Instruction Decode&Control Unit In FT-C55LP
4	Design And Implementation Of Mobile Security Access System Based On Domestic Commercial Encryption Algorithm
5	The Research Of Encryption And Decryption Algorithms In The Embedded Trusted Computing Platform
6	Research On The Key Techniques Of Application-Specific Instruction-Set Processors
7	Design And Implementation Of Secure Hash Algorithm Based On The GPU
8	Study On Application Specific Instruction Set Processor For Video Coding And Its VLSI Implementations
9	Research On Event-based Convolution Algorithm And Design Of Event-based Convolution Processor
10	Research On Fast Retrieval Method Of Image Data Based On Learning To Hash