| With the advent of Industry 4.0 centered on information fusion,centralized control is gradually replaced by distributed.This paradigm shift makes the system more scalable,but it is accompanied by an increase in the system’s complexity and the increasing difficulty of system modeling,analysis,and control optimization problems.Petri nets,which can describe parallelism,conflict,and causality in discrete event systems,have become one of the commonly used modeling tools for distributed concurrent systems since their birth.The reachability graph composed of all the reachable states of Petri net can display all the dynamic information of the Petri net,which is an essential basis for implementing many analysis and control methods.However,the change of the network structure and the increase of the initial identification will increase the number of reachable states and even cause the problem of state space explosion.Once the ”explosion” occurs,the calculation of the state space of Petri nets cannot be achieved by traditional serial algorithms.With the help of the general parallel computing architecture CUDA and related data structures,this thesis improves the algorithm by improving program parallelism and compressing memory storage.The main work is as follows:1.The existing serial algorithm is optimized,and the hash table structure is introduced to store the state,so that the performance of the algorithm is greadly improved.Then the algorithm is analyzed and summarized,and the minimum subtask in the solution process is obtained.And on this basis,with the help of the CPU-GPU heterogeneous architecture,the thread organization and memory allocation are analyzed and processed,and the CUDAbased Petri net reachable state parallel algorithm is obtained.In terms of data structure,the parallel algorithm adopts the composite structure of linked list and red-black tree to replace the linked list structure in the serial algorithm,which can store more elements and improve the efficiency of element insertion query.2.In the above work,only kernel-level concurrency is realized,and one thread can only execute one kernel function.To further deepen parallelism,CUDA streams are introduced to achieve grid-level concurrency.The reachable state parallel algorithm based on CUDA stream realizes the concurrent execution of multiple kernel functions,overlaps the execution of kernel functions and data transmission,and overlaps the calculation of CPU and GPU,which greatly taps the parallel potential of the CUDA architecture.3.Considering the high time complexity and space complexity of the linked list-red-black tree composite structure,the Bloom filter that occupies less memory and has a more straightforward structure is introduced.The Bloom filter composed of bit arrays converts state storage into simple binary operations,which significantly compresses the memory required for storage,and no longer involves conversion between data structures and rebalancing after adding elements.However,it lacks a locking mechanism,hence when multiple threads operate on an array index simultaneously,thread safety problems may occur.Finally,this thesis innovatively proposes a composite structure of the Bloom filter and hash table.Each hash bucket of the hash table stores a Bloom filter.The data structure inherits the lock mechanism of the hash table,which solves the problem of thread safety,and multiple bloom filters divide the state storage,which significantly reduces the false positive rate and solves the most significant defect of the bloom filter.Through the improvement from the shallow to the deep in the above three directions,the reachable state parallel algorithm based on CUDA stream and Bloom filter is finally extracted,making significant improvements in the degree of parallelism and memory compression.When calculating the reachable states of complex and straightforward Petri nets,the traditional serial algorithm often takes more than an hour to obtain the result,and the order of magnitude of the calculation can only reach the level of 30 to 40 million.The extracted parallel algorithm only needs one to two minutes to easily break through the computational bottleneck of the serial algorithm,and the acceleration can reach more than 60 times.With the increase of the number of reachable states,the acceleration is also increasing.The number of reachable states calculated by the parallel algorithm also increases from tens of millions to the level of 100 million.In addition,in the case of limited hardware conditions,compared with other classical parallel algorithms,the parallel algorithm in this paper is more stable,has a broader scope of application and higher acceleration efficiency. |