| The Floyd shortest path algorithm,a classic algorithm in the study of graph theory and algorithm design,has a very wide range of applications in many fields such as path planning and game development.Many real systems can be represented by large complex networks,and computing the shortest path between vertices is key to studying the characteristics of these networks.As the world’s first domestic supercomputer with a peak performance of 1.25 billion times,“ Sunway Taihu Light ” has been ranked first for many times,and is equipped with China’s self-developed Sunway many-core processor SW26010.Many software and algorithms have been successfully ported and optimized,and a very impressive acceleration ratio has been obtained.In recent years,with the development of many-core processors and high performance computing,the parallel implementation and high performance optimization of Floyd algorithms on many-core processors has remained a hot topic of research.At present,the related research of Floyd algorithm in Sunway platform is still blank.The Floyd algorithm has high time complexity and,but has regular access and computation rules.How to exploit its computational parallelism and storage locality according to the multi-level parallel computational resources and hierarchical storage resource of the multicore processor is the key to improve performance.However,the problem of extracting large amounts of parallelism from Floyd’s algorithm in the presence of output dependencies and read/write dependencies is not straightforward.How to design a correct and effective parallel algorithm,how to deeply optimize the parallel algorithm according to the hardware characteristics of the multicore processor and,in particular,how to partition the data in the presence of limited local storage space are all urgent issues to be solved.In order to find a more efficient shortest path algorithm on the Sunway platform,this thesis carries out research related to the design,implementation and deep optimization of Floyd’s parallel algorithm based on the Sunway platform.The main work and contributions in this thesis are as follows:·Multi-level task partitioning and deployment of Floyd parallel algorithm based on SW26010 processor.Based on the master-slave core architecture of SW26010 processor,firstly,process-level task partitioning is designed by dividing the data into two dimensions according to the number of processes opened and deploying the divided data to the main memory of the master core.Next,a thread-level task partition is designed,again in two dimensions,and the partitioned data is deployed to the local memory of each slave core.Since the local memory size of the slave cores is limited and not enough to put down all the computational data,we divide the matrix of each slave core at the data level and read the data one row at a time.The multi-level partitioning of tasks makes full use of the computational resources of the SW26010processor’s slave cores.·Implementation of Floyd parallel algorithm based on SW26010 processor.Based on the characteristics of SW26010 processor master-slave core architecture,masterslave accelerated parallel programming is used for implementation.The parallel algorithm was experimentally tested to achieve a performance improvement of up to67.35 times compared to the serial algorithm,and further analysis revealed that the restricted access bandwidth from the cores is a key factor affecting the performance of the algorithm after parallelism.·Deep optimization of Floyd parallel algorithm based on SW26010 processor.Based on the fact that “access memory of slave cores is the bottleneck that limits the performance of Floyd algorithm”,we reduce the access memory of slave cores through algorithm-level optimization,array partitioning and double buffering optimization techniques based on the access memory characteristics of SW26010 processor.Test results show that the SW26010 processor is able to improve program performance by reducing the access overhead of the slave cores through algorithm-level optimization,array partitioning and double buffering.Test results show that the Floyd parallel algorithm based on the Sunway platform can achieve a maximum speedup of 106 times compared to the serial algorithm.Further,the key factors affecting the performance of the algorithm after deep optimization were analyzed by using different numbers of slave cores. |