Font Size: a A A

Research On Massive Geospatial Data Processing In Cloud Computing Environment

Posted on:2018-11-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:W HuangFull Text:PDF
GTID:1360330515489800Subject:Photogrammetry and Remote Sensing
Abstract/Summary:PDF Full Text Request
With the eara of Internet of thing(IOT)coming out of the fast developing technologies of sensors,rapid processing and analyzing of GeoSpatial data become the primary concern of applications such as smart city and digital erath.The high spatial-temporal resolution remote sensing(RS)data is of huge volume and variety,which is the key for human to explore the global patterns of atomospheric,soil and water cycle of the earth.Combing with the explosively generated geo-tagged data,it shows a big picture for human to understand the relationship between environment and human activities.To satisfy the requirement of huge computing resource of these prcossing and analyzing applications,more and more researches direct using high performance computing that combines the vitualization techlogies to parallel processing of massive geospatial data.Both runtime scalable parallell processing algorithms and the effective and generic parallel mode are a must when handling the "big" GeoSpatial data.Although MapReduce had a great advatage on supporting the satisfying computing performance for in ofF-the-shelft PC clusters,the model does not support iterative computing,which is normal for machine learning algorithms.Moreover,the "data locality" computing introduces the dangerous of screw-processing,which would aggregate computional tasks on over-loadded computing nodes that are of lower peromance.Addtianlly,the performance of virual machines are highly impacted by the virulization techlogies in cloud computing environment.Real-time obtaining the analyzing results from massive geospatial data in cloud is still a challenge in the coming years.Another advanced computing model named Spark with lower fault recover costs,lower latency emerged,which cultivates the growing number of researches on leveraging it for processing big GeoSpatial data these years.This paper aims to address some key issues,including parallel computing model,algortihms implemnetaions and underlying framework optimization,on leveraging Spark to real-time process massive GeoSpatial data in cloud computing environment.Although Spark has sucessed to be an unfied big data processing engine,the advancement on intergately processing and analyzing of massive GeoSpatial data is rarely normal.The coarsely data-parallel model requires data partition policies consider the hetegerousity of underlying computing resources.The Spark-based GeoSpatial parallel algorithms must be self-adpative to the change of the computing resources.The scheduler and intermediate data management of Spark core never consider the hetegerousity and load-balnce of computing nodes.It does not support spatial operation either.Smart computing resource provision requires significant learning effort on Infrastruture as a Service(IaaS)and Plateforrm as a Service(PaaS)layers of cloud computing environment.This paper proposes serveral solutions to overcome these issues.The major contribution of this study is as follows.To fill a gap in intergrate processing and analzing of massive RS data,we proposed a strip-oriented parallel computing model that incropates strip abastraction with resilient distributed datasets.Complex parallel RS algorithms can be easily implemented by using transformation primitives and Bitterent-enabled broadcasting varibles.To efficiently processing the huge volume of RS data in Hadoop federated clusters,we proposed a generic RS image partition method to minimize the cost of inter-nodes transferring data blocks.We also evaluated the effeciecy of the algoritlhms on two different clusters that are classified as passive and positive resource management respectively.Experiments demonstrated a strong adaptiveness of the model irresptive of the differences and complexity of parallel RS algorithms.To address the bottleneck issue of Spark engine,and unbalanced workload of Spark-based RS algorithms occurring at runtime that leads to the large latency of parallel computing tasks,we proposed a Kademlia-based caching model for manging the intermediate data so that Spark scheduler can implicitly learn the workloads of Spark executors in computing nodes and does scheduling smarter.The efficiency of the joining-intensive RS image processing algorithms can be improved by 20.1%to 26.3%,and iterative-intensive algorithms can be improved by 20.7%to 32.1%as compared with algorithms using naive Spark according to the exprimntal result.To timely analyzing of massive spatial data in OpenStack clouds,we propose an elastic spatial query processing model in this paper.First,Spark components are assigned different selt-healing Docker containers,and then continers orchestion tools such as Kubernetes is used to dynamic supply elastic computing clusters according to workloads of the algorithms and input paramters.Finally,the autoscaling group is used to scale-in and scale-out virutam machines.Experiments indicates that elastic containers provision can satisfy the time-constrained spatial data analysis in OpenStack clouds provided that moderate number of conatiners and the Spark executor cores were used.To accurately mapping the pattern of soil moisture at national scale,we proposed a cloud-based deep learning architecture called ElasticSpark.By combing container-based virtualization and deep feed-forward frameworks,ElasticSpark facilitates intergrately processing and learning from massive remotely sensed data.A deep learning model was trained by using 12 bands of Visible infrared Imaging Radiometer(VIIRS)Raw data and in-situ soil moisture data as input paramters.We found that a deep learning model having 500 neutrons in each of 8 hiddens layers shows significant potential on learning complex non-linear relationship between in-situ soil mosture and estimated soil content with correlation coefficient R2 = 0.9875 and mean square error(MSE)= 0.00007 in China.The architecture showed obvious advatange over its countparts such as YARN cluters and H20 clusters on efficiently training deep neutron networks in clouds according to experiments.
Keywords/Search Tags:remote sensing parallel algorithm, spatial query processing, in-memory parallel processing framework, load balnce, cloud computing environment
PDF Full Text Request
Related items