Font Size: a A A

Research On Optimization Algorithms In Big Data Graphs Processing Systems

Posted on:2018-08-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:X D MengFull Text:PDF
GTID:1360330590955277Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Graphs have been increasingly important in analyzing large real-world data.Social networks such as Facebook,Twitter,and LinkedIn now have billions of active users,and new connections among users are increasing day by day.A world-wide-web network might contain billions of web sites and trillions of links among them,where each of those web sites does not conform to any standard structure.No matter how complex the data are,they may be naturally represented as data graphs in which data are stored on edges and nodes are object identities to glue those edges.These big graphs are often stored in distributed systems,leading to many difficulties in processing those graphs with efficient algorithms.In recent years,many distributed graph processing models and platforms were proposed.MapReduce and Pregel have been shown to be scalable to deal with big data as well as large graphs.Nevertheless,to obtain scalability,these models offer restricted forms in which users specify their programs.Hence,it is non-trivial for users to write their complicated programs as well as to obtain efficient ones.On the other hand,a powerful way to query data from graphs intuitively is regular expression.It has been applied in many useful applications of queries,such as,finding relationships in social networks,or finding chains of reactions in biological networks.However,evaluating regular expression-based queries on distributed graphs is non-trivial.First,regular expressions imply a highly sequential evaluation.Second,distributed evaluations often produce intermediate graphs whose size is larger than the input graph,and require a large amount of communications.Further,efficient caching mechanism plays an important role in large graph query,in which the I/O performance is one of the primary considerations since vertices and edges may be used repeatedly.The data retrieving from the lowest storage center to the upper level cache system performs a distributed multilevel cache system.Many excellent multilevel cache algorithms are proposed to improve the I/O performance,they still have potential to be enhanced by investigating the history information of hints.This dissertation's objective is to improve the performance of graph query and transformation system which uses regular-expression-based queries and scalable distributed programming models.We study systematic approaches to build a general framework that evaluate the regular-expression-based queries more efficient.First,we propose a partition based approach to improve the performance of graph query and transformation by using structural recursion.The system uses select-where regular path query(SWRP)as an input.The SWRP queries are first translated into structural recursive functions on graphs.Then the functions are compiled into efficient programs in Pregel.We propose optimizations of structural recursion for large graphs to minimize the number of iterations during graph query processing in a distributed system.In order to improve query response time and throughput,subgraph based computation on structural recursive functions are used to replace typical vertex based evaluation approach.Meanwhile,we propose several redundancy rules to reduce graph size in parallel.We verify the performance of our system through evaluation on real datasets.The results show that,compared to the state-of-the-art approach,our algorithm achieves speedup over two times on query response time.Second,we introduce a functional-based approach to further improve the performance of queries for distributed graphs.The computation of reachability during the evaluation of queries takes more time than the other computations due to the dependency of vertices between subgraphs in a distributed graph.This demands further refinement of our framework.We start with a more fundamental query that is a regular reachability query.An RRQ test the existence of a path between two vertices in a directed graph,where the requirement for the path is described by a regular language over labels.We propose a novel algorithm that uses functions to present a set of state mappings in order to accelerate local evaluation and reduce communication cost,removing the bottlenecks in distributed RRQ.Extensive experiments conducted using real datasets show that our approach can significantly improve the overall performance,especially for large and sparse graphs,as compared to the state-of-the-art method.Third,we solve the caching problem in graph query and transformation system by introducing a Hint Frequency-based Approach(HFA).We observe that the access frequency of graph data is highly related to hint frequency which is a frequency of vertex/edge attributes traveling among cache levels.The main idea of HFA is using hint frequencies to efficiently explore the valuable history information of graph data among multiple levels.HFA can be applied with several popular multilevel cache algorithms,such as Demote,Promote and Hint-K.Last,we also study the energy consumption of our graph system.In terms of multi-level cache system,we propose a novel Power-Aware Multi-level cache(PAM)policy,which can reduce the energy consumption of storage devices in graph system with both high performance and high I/O bandwidth guaranteed.In our PAM policy,a proper number of cold dirty vertices in the upper level cache are identified and selected to flush directly to the storage devices,which provides high probability to extend the duration time of data disks with standby status.Thus the energy consumption can be reduced.
Keywords/Search Tags:Structural recursion, regular reachability query, Hint, Multi-level cache, Energy efficiency
PDF Full Text Request
Related items