Font Size: a A A

Efficient partitioning approach to large scale distributed graph datasets

Posted on:2015-10-14Degree:Ph.DType:Dissertation
University:State University of New York at BinghamtonCandidate:Wang, RuiFull Text:PDF
GTID:1478390017993707Subject:Computer Science
Abstract/Summary:PDF Full Text Request
Big data problems challenge traditional techniques and tools. Current emerging techniques try to scale such problems within the capacity of commodity hardware with new solutions. One classic approach to maintain and process large datasets is distributed computing. However many traditional techniques do not take full advantage of the potential of distributed systems. Dataset partitioning is the first and most important step for distributed computing. However this step is still not well-addressed, and thus many applications generate poor data partitions. Hence most of the power of the distributed system is not used for computations, but rather for shuffling a large amount of data around the system.;In our proposal, we proposed to explore and develop new dataset partitioning approaches to distributed RDF triplestore. This approach should retain the linkage among data, and keep related data together in one partition. The size of the data partition should also be balanced. We proposed to explore and design more an efficient and optimized system architecture to leverage the advantages provided by graph data partitions. This new system architecture together with graph partitions should reduce the communication cost of the whole system. In further study of the applications of graph partitioning approaches in Big Data, we proposed to explore and develop online approaches for processing large scale graph datasets in a streaming way. The online approach should scan the data only once, and process the whole dataset in linear time complexity. Also this approach should use the available memory of the system wisely. We proposed to developed a scalable and efficient graph index structure for distributed graph datastores. The graph index should be easy and quick to retrieve. The size of the index should be easy to fit in the memory.;In this dissertation, first, we verified the effectiveness of a graph partitioning approach to the management of large scale datasets with an offline algorithm; second, we developed a more scalable, effective, and low complexity approach to meet the purpose of large scale dataset processing in industrial applications. In the first stage of our research, we proposed a promising approach that utilized the graph nature of RDF datasets to minimize relations between partitions for dataset partitioning. Based on applying uniform graph partitioning to the dataset, a local index was designed to filter intermediate results and optimize sub--querying, and a new system design was proposed to better exploit graph partitioned datasets. In the second stage, we explored a more scalable and less complex algorithm to process datasets at a larger scale. An online graph dataset partitioning was developed to produce high quality dataset partitions with fewer links between partitions.;As shown in our experiments, the graph partitioning approach is verified to effectively reduce the communication cost of query-processing messages compared to the hashing approach and enhance parallelism through independent sub-querying. The results also show that our online approach performed better than the hashing approach in the metrics of communication cost of query-processing messages, provided a comparable performance to an offline graph partitioning approach, and balanced the size of partitions compared with other approaches. This provides an effective solution for larger scale problems.
Keywords/Search Tags:Scale, Approach, Data, Graph, Large, Partitioning, Distributed, Partitions
PDF Full Text Request
Related items