| Graphs can be used to model complex relationships between things,and are widely used in many areas.Recently,large-scale graph mining has become a popular research topic.Due to the increased volume of graph data and the iterative character of graph computation,the traditional Map Reduce framework requires multiple reads and writes of HDFS,making it difficult to support efficient graph computation tasks.In order to improve the efficiency of large-scale graph mining,academia and industry have designed many parallel graph computing platforms.Currently,large-scale graph mining is facing the following challenges:(1)The graph is relatively large,need single-side scanning,and high real-time.Graph partitioning is the first step for parallel graph mining.In order to improve the efficiency of graph partitioning,many parallel graph mining systems have adopted stream graph partitioning strategy.However,real-time partitioning only reads graph information once,so it is difficult to guarantee the quality of graph partitioning.(2)The vertices usually follow power-law distribution in real-world graphs.The high-degree vertices,which may have millions of neighbors,are difficult to partition in power-law graphs.A single partitioning strategy cannot optimize both high-degree and low-degree vertices at the same time.And it is difficult to balance load balancing and communication overhead.(3)The computational complexity of many graph mining algorithms is extremely high.Many graph mining algorithms need to find specific substructures from the graph.The computational complexity of such algorithms increases exponentially with the increase of the graph size.The traditional ”think like a vertex” model is difficult to support efficient computing for complex graph mining algorithms.To solve the above-mentioned challenges in large-scale graph mining,this thesis has studied the efficient graph mining platform from two dimensions: data parallelism and task parallelism.Specifically,we study three key issues of the graph mining system:the model for stream graph partitioning,hybrid graph partitioning for power-law graphs and parallel mining framework for frequent subgraphs.The main contributions of this thesis are summarized as follows:(1)Stream graph partitioning algorithm based on combinatorial design.Inthis thesis,the stream edge partitioning problem is modeled as a classic combinatorial optimization problem with constraints.Based on the combinatorial optimization,we aim to find the optimal solution that satisfies both load balancing and low communication cost.We propose the balanced edge partition design model from a theoretical perspective.The model not only ensures that the generated graph partitions have a strict upper bound on the replication factor,but also has extremely high partitioning throughput.Compared with the HDRF algorithm,the algorithm we proposed can reduce the time of streaming graph partitioning by about three times,and improve the partition quality by more than double.(2)Hybrid partitioning algorithm for power-law graphs.This thesis proposes a hybrid partitioning algorithm for power-law graphs.We use the balanced edge partition design model to address the challenge of high-degree vertices partitioning in power-law graphs,and use the label propagation strategy to partition low-degree vertices into partitions where the nearest neighobrs are.The hybrid algorithm can preserve the local topology of the power-law graph well.Experiments on public datasets show that our proposed graph partitioning algorithm can greatly improve the quality of graph computing up to two times.(3)A multi-core based parallel frequent subgraph mining framework.This thesis proposes Prefix FPM,a general parallel mining framework for frequent subgraph mining.In Prefix FPM,we propose the prefix-projection structure,which declines the complexity of frequent subgraph mining.At the same time,Prefix FPM adopt task-based parallel mining strategy to fully utilize the CPU resource of the machine.Besides,Prefix FPM supports various types of frequent subgraph mining tasks(path,tree,closed pattern,etc.).Experiments on public datasets show that Prefix FPM improves the efficiency of frequent subgraph mining by two orders of magnitude.(4)A distributed three-phase frequent subgraph mining framework.To address the communication bottleneck of frequent subgraph mining in distributed scenarios,this thesis implements a distributed frequent subgraph mining framework,Par Prefix FPM.Par Prefix FPM is designed based on the prefix-projection mining framework and adopts a three-stage frequent subgraph mining process,and implements an asynchronous message communication mechanism based onthe producer-consumer model.Experiments show that Par Prefix FPM has lower communication overhead in distributed frequent subgraph mining,and can provide a good speedup ratio for frequent subgraph mining algorithms. |