Font Size: a A A

Distributed Stream Processing Of Big Spatial Data

Posted on:2022-12-06Degree:MasterType:Thesis
Country:ChinaCandidate:L B YuFull Text:PDF
GTID:2480306767966049Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
More and more geographic information related industries have built real-time data acquisition and transmission links,and massive spatial data streams have been continuously connected to the information world.Many real-time applications need to process and analyze this kind of data immediately,so as to provide a highly timeeffective solution for complex spatial-temporal problems.The traditional batch processing engine can realize the distributed processing of spatial big data after proper extension.However,due to the limitation of the underlying execution engine,it can only handle bounded historical spatial data with high throughput.In order to realize the low delay stream processing of spatial big data,we must use the distributed stream processing engine.For this reason,after fully mining the common stream processing problems behind many real-time application scenarios,this paper systematically gives the formal definition of spatial big data streaming query and clustering processing,and designs implementation algorithms suitable for distributed stream processing engine.In order to realize the streaming query processing of spatial big data,a two-tier query processing framework is designed in this paper.Firstly,the global grid is used to partition the data to achieve multi-partition parallel processing,and then the local memory R-tree index is used in each partition to achieve efficient spatial retrieval.In addition,in order to correctly realize the cross-partition join of partition boundary data in the scenario of two spatial data streams join,a partition boundary data redundancy algorithm is designed in this paper.That is,the data that falls within a certain distance of each partition boundary in one data stream is redundantly routed to the adjacent partitions.For streaming clustering,a two-stage streaming DBSCAN algorithm based on pairwise distance join is proposed in this paper.In the first stage,the parallel pairwise distance join of a single spatial point data stream is realized with the help of the twotier query processing framework,that is,pairwise pairs of points in the data stream whose distance is within a certain threshold.In order to correctly realize the crosspartition join and avoid repeated results,a partition boundary data minimization redundancy strategy for spatial point data is further designed.In the second stage,the DBSCAN algorithm with (9))complexity can be realized on the basis of pairwise distance connection.In order to verify the effectiveness of the streaming query and clustering defined in this paper in the real scene and the feasibility of the designed algorithm,this paper implements the spatial data stream processing prototype system Glink,and tests the performance on the real data set.Glink adds additional spatial data stream layer and spatial data stream processing layer on the basis of Apache Flink,and provides a simple streaming query and clustering processing interface.According to the performance test results of real data,Glink can achieve a throughput of about 100,000 per second using single parallelism in streaming queries,and has a good ability to scale out;in streaming DBSCAN,Glink can achieve a throughput of more than 10,000 per second.
Keywords/Search Tags:Spatial Big Data, Distributed Streaming Processing, Streaming Query, Streaming Cluster
PDF Full Text Request
Related items