Font Size: a A A

Research On Top-k Graph Similarity Search Algorithm Based On Chi-Squared Statistics In Probabilistic Graphs

Posted on:2023-03-29Degree:MasterType:Thesis
Country:ChinaCandidate:K YangFull Text:PDF
GTID:2530307076485344Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The subgraph matching problem aims to find all subgraphs in the data graph that have the same structure as the query graph.This problem is widely used in compound discovery,visual pattern recognition,communication network,community discovery and other fields.In practical applications,the connection between objects has a certain degree of unreliability.For example,in a communication network,the reliability of the connection between nodes is usually expressed by probability,which needs to be modeled by a probability graph.On probabilistic graphs,using subgraph matching algorithms on deterministic graphs returns a large number of poor-quality results because the presence or absence of edges is related to their probabilities on the edges.Therefore,a series of subgraph matching methods for probability graphs have been proposed.The state-of-the-art method for querying similar subgraphs on probability graphs is the Chi Se L algorithm proposed by Sourav Dutta et al.,which returns the top k most similar subgraphs in the data graph to the query graph.The Chi Se L algorithm uses the chi-square statistics in statistical analysis to efficiently find the Top-k results.The effectiveness of the chi-square statistics solution depends on the validity of the sample observations and expected values.The Chi Se L algorithm assumes that the labels have uniform distribution characteristics and calculates the chi-square value based on it.Due to the fact that the tags do not meet the requirements of uniform distribution,the quality of the query results is not high.In order to solve the problems above,this paper proposes a new Chi SSA(Chi-squared Statistics Based Top-k Similar Subgraph Search Algorithm)algorithm based on Chi-square statistics.According to the actual distribution of labels in the data graph,the algorithm proposes two new calculation methods of expected vectors: further,two optimization methods are proposed,namely the greedy strategy of expanding vertices based on chi-square value and the neighbor label filtering method.The former does not repeatedly consider the edge probability when expanding the vertices,thereby improving the quality of the result;the latter filters the unqualified vertex pairs by examining the neighbor label relationship and reducing the number of vertex pairs that need to calculate the chi-square value and improving the algorithm efficiency.The specific contributions of this paper are as follows:1)A chi-square statistics based Top-k similar subgraph search algorithm Chi SSA is proposed on the probability graphs.Based on the real distribution of labels in the data graph,the algorithm proposes two new calculation methods for the expected vector,which effectively improves the accuracy and quality of the Top-k results.2)Propose a greedy strategy for expanding vertices based on the chi-square value and an optimization method named neighbor label filtering.The former optimizes the way of selecting extended vertices during the matching process and improves the quality of the results;the latter examines the relationship between the vertex neighbor labels in the query graph and their candidate vertex neighbor labels,and filters out vertex pairs that cannot improve the accuracy of the results.,which reduce the number of vertex pairs with square values improves the running efficiency of the algorithm.3)We do experiments based on the real datasets.The experimental results show that the method proposed in this paper improves the accuracy,quality and efficiency of the algorithm.
Keywords/Search Tags:probability graphs, subgraph matching, Chi-Squared statistics, Top-k query
PDF Full Text Request
Related items