Font Size: a A A

Research And Implementation Of Subgraph Matching Algorithm Based On Spark

Posted on:2018-09-13Degree:MasterType:Thesis
Country:ChinaCandidate:T GuoFull Text:PDF
GTID:2310330512993281Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The graph is a kind of data structure formed on the basis of vertex and edge,and has a very flexible expression ability to describe the relationships between things.With the arrival of big data age,the scale of data is increasing unprecedentedly.Facebook,Twitter,Weibo and other social media has produced a large-scale social data.How to deal with large-scale graph data has become a very critical research problem.The sub-graph matching problem is a very basic and important problem in the field of graph data processing.Also,the search and pattern mining algorithms on the graph must be based on a fast subgraph matching algorithm.The mathematic basis of the sub-graph matching is the classical problem subgraph isomorphism in the graph theory,a well-known NP complete problem.Some of the methods of pure algorithm are given in the previous research work.At present,although some scholars have given some ways to achieve sub-map matching,but these methods can only deal with small-scale graph data.In response to today's large-scale data,the matching efficiency and scalability are very inadequate.In addition,most sub-graph matching algorithms are applied to undirected graphs.And there are problems of inadequacy or inefficiency when dealing with directed graphs.In view of the above problems,this paper relies on the big data platform which has been developed in recent years,and uses it to provide powerful storage and computing power.For the first time,Spark,a big data processing platform,is used as the processing engine to implement a subgraph isomorphism algorithm which can deal with large graph data with GraphX component.The algorithm is based on HDFS as the storage platform,which effectively solves the problem of cluster load balancing.The calculation process is divided into distributed filtering stage and distributed verification stage.In distributed filtering stage,we proposed the vertex signature data structure with taking the characteristics of the Spark platform and the graph segmentation strategy of GraphX with vertices as the split into account.By matching vertex signature parallel,we realized the fast and efficient filtering of the data graph.Besides,the vertex signature expresses itself and neighborhood information,and the parents and children in neighborhood are distinguished,which improves the filtering effect on the directed graph.In the distributed verification phase,the concept of candidate subgraph matching region is proposed by using the Spark platform's distributed computing advantage.Through the calculation of the center vertex of the query graph,a number of candidate subgraph matching regions corresponding to the size of the query graph are obtained in the data graph.The filtered oversized graph data is further segmented to perform efficient subgraph matching operations in the smaller candidate subgraph matching region.Finally,the experiments show that the distributed sub-graph matching algorithm has a good matching efficiency and scalability.In the contrast experiment with the current excellent sub-graph matching algorithm VF2,this algorithm has obvious efficiency advantage.
Keywords/Search Tags:Spark, subgraph isomorphism, graph mining, parallel algorithm
PDF Full Text Request
Related items