Font Size: a A A

Research On Key Technologies For Fast Approximate Graph Pattern Mining

Posted on:2023-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y J WangFull Text:PDF
GTID:2568307025465964Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Along with the exponential increment of Internet data recently,the development of big data technology has been promoted drastically.Most data is generated from the connection and interaction between separate entities.Which is very suitable for modeling with graph structure.We can store entity information as vertex,and store interaction information between entities as edge.Graph mining explores the relationship between data by analyzing the topology of the graph.Graph pattern mining is the key to graph mining,which is used to find specific subgraph structures.The scale of graph data is becoming larger and larger,meanwhile many applications are making higher requirements for realtime response.Under these circumstances,the time of traditional graph pattern mining becomes unacceptable.In fact,accurate results are not necessary for many applications that only require them to be of the same magnitude order.Nowadays,approximate graph pattern mining is gaining more and more attention because it reduces the size of the graph dataset by sampling and estimates the result of the whole dataset based on probability.However,the current approximate methods are still expensive and difficult to scale.Researchers have found most real-world graphs follow the power-law distribution,in which the relation function between vertex and degree is a power function.Through many experiments,it is found that the contribution to patterns of vertices with different degrees is significant in the power-law graph.Based on the above observation,this thesis proposes two fast approximate graph pattern mining algorithms and corresponding preprocessing techniques.· Power-law Driven Estimator Distribution: PDED.A lot of graph data shows that the edges of vertices with high degrees contribute more patterns than those with low degrees.Traditional approximate graph pattern mining algorithms do not consider the power-law distribution,and uniformly sample all edges,resulting in low sampling efficiency and inaccurate scaling factor.PDED allocates estimators based on power-law distribution and samples edge at the whole graph.PDED is a fast and accurate approximate pattern mining algorithm.It not only reduces the cost of scanning the edge stream but also improves the sampling probability of the high-degree edges.Furthermore,PDED can balance between error rate and execution time by controlling the number of estimators.· Pattern Occurrence Distribution-based Approximation: PODA.PODA has devised from the observation that pattern number cumulative distribution with degree sequence follows the power-law distribution in power-law graphs.Firstly,PODA only scans edges with a small number of degrees and calculates pattern numbers as samples.Then,PODA fits the distribution model of the pattern and estimates entries patterns number according to the model.Because there are errors in model fitting,PODA is also an approximate mining algorithm,but it speeds up processing and improves the scalability by reducing the number of scanned edges compared with traditional sampling algorithms.At the same time,the sampling in PODA is a subproblem for calculating the number of patterns.To further provide the flexibility of error rate and execution time,PODA supports accurate(ACC)and approximate(APP)counting modes,which can apply any current algorithm.· Preprocessing Based on degree distribution.The power-law feature is related to the degree sequence.Before the pattern mining task,processing the degree sequence of the graph can obtain degree information and generate auxiliary data structure.This is helpful for applying power-law distribution better.Graph simplification removes self-loops and parallel edges.Degree reordering ensures that the edges of vertices of high degrees can be sampled with higher probability.The degree dictionary supports the partition of subgraphs by degrees.Compressing sparse row indexes can quickly separate neighborhood vertices and accelerate the mining algorithm.The prototype is designed and implemented with Spark Graph X and is evaluated against various well-known graphs.Compared with accurate pattern mining algorithms,PDED speeds up execution time by 1-3 orders of magnitude,ACCand APPof PODA speed up 1.5x-2x and 4x-50 x,respectively.Compared with approximate pattern mining algorithms,PDED is about 100 x faster,while the average speedup of ACCand APPof PODA is up to 5x and 20 x,respectively.All error rates in PODA are less than 10%,which meets the general requirements of approximate pattern mining applications.
Keywords/Search Tags:Graph Mining, Pattern Mining, Approximate Calculate, Power Law Graph
PDF Full Text Request
Related items