Font Size: a A A

Design And Implementation Of The Streaming Graph Engine On Imbalance Cluster

Posted on:2017-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y CaoFull Text:PDF
GTID:2370330590991619Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,the demand for large-scale graph data processing is increasing strongly.To deal with these data,some graph computation models has been proposed,such as SR Model[1]and GAS Model[2,3].There are two kinds of graph computing engines:offline processing and streaming processing.Streaming graph engines include PHISH[4,5]and Flink[6,7].But they are rarely used in industrial circles.Industrial circles often use offline engines,such as Pregel,GraphLab[8,9]and GraphX[10].So,how to implement a streaming graph engine based on GraphX is meaningful.The graph engines mentioned above,including GraphX,are all based on one assumption that the performance of computing nodes are the same,and the large-scale graph should be partitioned equally.However,in industrial environments,the computation abbilities of nodes in the cluster are often imbalance.In this condition,the slowest node will limit the computing time of the cluster.So,it's meaningful to research the partitioning approach for imbalance cluster.In view of the above problems,this paper takes the actual project as background and analyzes the factors that will affect the partitioning algorithm imbalance clusters.Then it puts forward a streaming partitioning algorithm for the imbalance clusters,and implements a streaming graph engine based on GraphX.Then,this paper compared the performance among different streaming partitioning algorithms.Tests are done in clusters with different size and different imbalance degree with different datasets.The results prove that our algorithm can control the time increasing when the imbalance degree increases.Compared with other similar systems,this work has the following characteristics:1)The existing flow graph partitioning algorithms includes Hash Partition,Balanced Partition,Chunking Partition,Deterministic Greedy Partition[11,12]and so on.But they do not consider the problem of cluster imbalance.To deal with the problem that computation performance limited by the slowest node,this paper proposed a partition algorithm for streaming graph on imbalance clusters named PASGIC.The algorithm dynamically monitors the CPU and memory usage condition of clusters.Based on these indicators,PASGIC uses greedy strategy to partition the graph.It aims to avoid the side effects of the slowest node.Experiments show that PASGIC has more than 5%higher processing speed than Deterministic Greedy and Random partitioning algorithm and more than 16%higher than random partitioning imbalance cluster.2)GraphX is widely used in industry,but it does not support streaming data processing.This paper designs and implements a streaming graph computation engine GraphA by extending the operation mode and graph partitioning method of GraphX.It also implements PASGIC algorithm in this engine.The engine has several features,including auto detecting the cluster imbalance,dynamic partitioning streaming graph by the imbalance degree and auto applying and releasing computing resources.Experiments show that GraphA can correctly handle the streaming graph,and the recall of predicting the need of resources operation reached over 75%.3)In the field of verification of the streaming graph engine,there is a lack of the measurement index of the imbalance of the cluster.In this paper,a new index system for measuring the imbalance of clusters is proposed.Based on this index system,we use 6different clusters and different datasets to verify GraphA.Experimental results show that GraphA can partition the graph according to the imbalance degree of cluster correctly.
Keywords/Search Tags:Distributed Graph Computing, Imbalance Cluster, Streaming Computing, Large-scale Data Processing
PDF Full Text Request
Related items