Research On Top-k Graph Similarity Search Algorithm Based On Chi-Squared Statistics In Probabilistic Graphs

Posted on:2023-03-29

Degree:Master

Type:Thesis

Country:China

Candidate:K Yang

Full Text:PDF

GTID:2530307076485344

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The subgraph matching problem aims to find all subgraphs in the data graph that have the same structure as the query graph.This problem is widely used in compound discovery,visual pattern recognition,communication network,community discovery and other fields.In practical applications,the connection between objects has a certain degree of unreliability.For example,in a communication network,the reliability of the connection between nodes is usually expressed by probability,which needs to be modeled by a probability graph.On probabilistic graphs,using subgraph matching algorithms on deterministic graphs returns a large number of poor-quality results because the presence or absence of edges is related to their probabilities on the edges.Therefore,a series of subgraph matching methods for probability graphs have been proposed.The state-of-the-art method for querying similar subgraphs on probability graphs is the Chi Se L algorithm proposed by Sourav Dutta et al.,which returns the top k most similar subgraphs in the data graph to the query graph.The Chi Se L algorithm uses the chi-square statistics in statistical analysis to efficiently find the Top-k results.The effectiveness of the chi-square statistics solution depends on the validity of the sample observations and expected values.The Chi Se L algorithm assumes that the labels have uniform distribution characteristics and calculates the chi-square value based on it.Due to the fact that the tags do not meet the requirements of uniform distribution,the quality of the query results is not high.In order to solve the problems above,this paper proposes a new Chi SSA(Chi-squared Statistics Based Top-k Similar Subgraph Search Algorithm)algorithm based on Chi-square statistics.According to the actual distribution of labels in the data graph,the algorithm proposes two new calculation methods of expected vectors: further,two optimization methods are proposed,namely the greedy strategy of expanding vertices based on chi-square value and the neighbor label filtering method.The former does not repeatedly consider the edge probability when expanding the vertices,thereby improving the quality of the result;the latter filters the unqualified vertex pairs by examining the neighbor label relationship and reducing the number of vertex pairs that need to calculate the chi-square value and improving the algorithm efficiency.The specific contributions of this paper are as follows:1)A chi-square statistics based Top-k similar subgraph search algorithm Chi SSA is proposed on the probability graphs.Based on the real distribution of labels in the data graph,the algorithm proposes two new calculation methods for the expected vector,which effectively improves the accuracy and quality of the Top-k results.2)Propose a greedy strategy for expanding vertices based on the chi-square value and an optimization method named neighbor label filtering.The former optimizes the way of selecting extended vertices during the matching process and improves the quality of the results;the latter examines the relationship between the vertex neighbor labels in the query graph and their candidate vertex neighbor labels,and filters out vertex pairs that cannot improve the accuracy of the results.,which reduce the number of vertex pairs with square values improves the running efficiency of the algorithm.3)We do experiments based on the real datasets.The experimental results show that the method proposed in this paper improves the accuracy,quality and efficiency of the algorithm.

Keywords/Search Tags:

probability graphs, subgraph matching, Chi-Squared statistics, Top-k query

PDF Full Text Request

Related items

1	Research On RDF Subgraph Matching Method In Distributed Environment
2	Research On Method Of Subgraph Query And Match Based On K Tolerance
3	Research On Subgraph Query Algorithm Based On Double Index
4	Research On Multi-subgraph Matching Technology In A Distributed Environment
5	The Research On Probability Network Motifs Detection Algorithm Based On Protein-protein Interactions
6	Subgraph Similarity All-matching Techniques Over Uncertain Graphs
7	Research On Efficient Subgraph Matching Algorithm
8	Research On Multiple Query Methods On Large Scale RDF Graph
9	Studies On Query Processing Techniques Over Uncertain Graph Data
10	Research On The Parallel Query And Analysis Technology Of K-edge-connected Subgraph Over Large Graphs