Font Size: a A A

The Research On Parallel Overlay Analysis Method Of Massive Complex Polygons Based On Spark

Posted on:2021-11-22Degree:MasterType:Thesis
Country:ChinaCandidate:C WangFull Text:PDF
GTID:2480306500475554Subject:Photogrammetry and Remote Sensing
Abstract/Summary:PDF Full Text Request
With the substantial increase in the scale of geospatial data and the deepening and refinement of GIS applications,we often need to quickly process and analyze some massive geospatial data in work research,which is beyond the scope of traditional GIS processing modes.As one of the basic algorithms in geospatial analysis,overlay analysis is widely used in GIS,and it is a typical calculation-intensive and dataintensive algorithm.In order to solve the problem of low processing efficiency of the overlay analysis algorithm in the face of massive and complex spatial data,the existing research has made many attempts in the environment of multi-core computers and shared memory,and its computational efficiency has also been improved.With the rapid development of computer technology and communication technology,and the emergence of new computing architecture systems,traditional methods have a lot of room for improvement in terms of scalability,computing complexity,and computing efficiency.As a new technology in the Internet field,big data technology can solve the problem of efficient storage and rapid analysis of massive data through some new storage and computing architecture technologies.The application in the GIS field is currently relatively small.As a new parallel computing architecture technology in the current Internet field,Spark based on the memory computing architecture shows great advantages in the rapid processing and analysis of big data.Aiming at the low computational efficiency of the overlay analysis algorithm in the face of large-scale and massive complex polygon data,this paper proposes a sparkbased overlay analysis method for massive complex polygons,and introduces the Spark technology in the big data field into In the geospatial analysis,a data division method based on the actual calculation of polygon data is balanced.Relevant experiments show that the new parallel computing architecture adopted in this paper,compared with the traditional processing methods,improves the efficiency of the overlay analysis algorithm while ensuring the correct processing results.The main research content and results of the paper are as follows:(1)By summarizing the relevant theories of the core components in the open source Hadoop ecosystem,distributed memory computing architecture Spark,polygon overlay analysis and polygon parallel overlay analysis algorithm,it provides relevant theoretical basis for subsequent research.By analyzing the existing parallel computing mode,the parallel overlay analysis method of data parallel is determined in this paper.(2)Combining the structural characteristics of vector polygon data and the principle of parallel overlay analysis algorithm,a storage model of vector polygons in a distributed environment is designed.This model effectively takes into account the information integrity and later stages of vector polygon data in a distributed storage environment computational efficiency issues.Aiming at the problem of low parallel computing efficiency caused by data skew between nodes in parallel computing,a data partitioning method based on Hilbert curve and polygon data actual calculation load balancing is proposed.This strategy effectively taking into account the principle of balanced data calculation effectively improves the calculation efficiency of the parallel overlay analysis algorithm.In order to further improve the efficiency of the overlay analysis algorithm,a hybrid spatial index is designed.This index can effectively improve the filtering efficiency of the algorithm in the filtering stage of the overlay analysis.In addition,in the data division process,the elements between the vector polygons across multiple grids due to structural complexity,size differences,and uneven distribution,a strategy for handling cross-boundary polygon problems is proposed.This strategy effectively reduces polygons across multiple grids during the calculation phase problem of double counting.(3)Based on the content of the appeal study,a parallel stack analysis method based on memory computing architecture Spark is proposed,and a prototype system of stack analysis is implemented based on the open source Hadoop ecosystem.Finally,the rationality and effectiveness of the related strategies proposed in this paper is verified through experiments.The experimental environment is a Spark cluster composed of six high-performance Dell servers.The specific configuration of each server is: E5-2620CPU(12 cores,2.4GHz),128 G memory,and 9T disk.In order to verify the correctness and effectiveness of the parallel stacking strategy proposed in this paper,four different sets of experiments are designed: the performance analysis of the stacking analysis algorithm under different data volumes,the analysis of the impact of different parallel granularity on the performance of the stacking algorithm,analysis of the effect of different grid granularity on the performance of the overlay analysis algorithm and analysis of the effect of different data division methods on the performance of the overlay analysis algorithm.The related experiments show that the calculation result of the parallel overlay analysis algorithm designed in this paper in the cluster environment is consistent with the calculation result of Arc Map,which shows the correctness of the parallel overlay analysis result in the cluster environment designed in this paper.In addition,the six-node parallel overlay analysis strategy performs better at different data volumes than single-node and Arc Map computing strategies,and this advantage gradually increases with the increase of data volume.At 10 million data volume,the calculation time of six nodes is 4380 seconds,the calculation time of Arc Map is 9930 seconds,and the calculation time of a single calculation node is 10164 seconds.Compared with the data partitioning strategy that aims to balance the amount of data,the designed data partitioning strategy that aims to balance the amount of calculation gradually increases in advantage as the amount of data reaches 1272 seconds.This is mainly because the method of balancing data volume as a data division strategy does not really reflect the actual calculation amount between data,and the data calculation volume balance to divide the data can solve this problem,and this method for large data volume and the division effect of the data set with large difference in complexity between the data is the most obvious.Through the above experiments,it is shown that the parallel overlay analysis method of massive complex polygons based on memory computing architecture proposed in this paper has good adaptability for the overlay analysis of large-scale complex polygons.
Keywords/Search Tags:memory computing architecture, massive complex polygons, overlapping intersections, parallel compute
PDF Full Text Request
Related items