Font Size: a A A

Research On Signle-source Top-k Query Algorithm Based On SimRank

Posted on:2022-12-23Degree:MasterType:Thesis
Country:ChinaCandidate:M T MinFull Text:PDF
GTID:2480306779471624Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
SimRank is used to evaluate node similarity based on graph topology.The single-source Top-k query problem based on SimRank is to find the k nodes with the highest similarity to a given query node,which is widely used in large-scale graph mining,collaborative filtering and other fields.When calculating the Top-k results,existing algorithms need to calculate the SimRank between all nodes and the query node,and then sort and filter out the Top-k result.Because the existing algorithms have many invalid calculations in the calculation process,the efficiency of solving the Top-k result is low.To this problem,this paper studies the single-source Top-k query problem.The specific research content is as follows.First,this paper proposes a Top-k query algorithm HitSimbased on the upper and lower bounds of SimRank.Unlike existing algorithms that directly calculate the SimRank of all nodes,HitSimimproves the efficiency of Top-k query by reducing the number of nodes to be calculated.The algorithm first classifies the nodes,obtains the corresponding upper bound of SimRank for different types of nodes,and then maintains the lower bound of the Top-k SimRank during calculation.By comparing the upper and lower bounds,the number of nodes to be calculated can be reduced.Secondly,this paper proposes an optimization algorithm HitSim-OPT for dense graphs.When HitSimclassifies the nodes of dense graphs,the nodes may be concentrated in one node set,resulting in poor pruning effect of HitSim.To solve this problem,HitSim-OPT directly calculates the SimRank upper bound of all nodes,maintains the SimRank lower bound of the Top-k result during the calculation process,and determines whether the corresponding node needs to be processed by comparing the upper and lower bounds.At the same time,for the nodes that need to be processed,the upper bound of SimRank can be used to reduce the number of random walks.Finally,based on 8 real datasets for testing,the experimental results show that the HitSimalgorithm and HitSim-OPT algorithm are nearly three orders of magnitude faster than the existing algorithms in terms of query efficiency,while at the same time ensures the accuracy of query results.
Keywords/Search Tags:data mining, node similarity, single-source SimRank, Top-k query, random walk
PDF Full Text Request
Related items