Font Size: a A A

Research On Similarity Search Technique For Big Data

Posted on:2024-08-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z QiuFull Text:PDF
GTID:2568306941964039Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the era of big data,there are numerous types of data with complex relationships.However,traditional linear scanning methods are unable to complete a big data search task.Similarity search is a basic subject in computer science,which has become a research hotspot in the field of big data in recent years.It can effectively solve the problems of slow search speed,large index space occupation and inaccurate search results caused by the rapid growth of data scale in the era of big data.However,as a powerful data model,graph can comprehensively describe the complex attribute information and rich semantic structure among data,which has been employed in many fields.This paper takes the graph similarity search algorithm based on the "filtering-verification" framework as the research object and the graph edit distance as the similarity measurement to carry out research from three aspects:filtering algorithm,graph edit distance algorithm and practical application.The main research contents and contributions are as follows:(1)Aiming at the problems of large candidate set,index space and long query response time,an improved graph similarity search algorithm with high efficiency and low index,called Z-Index,is proposed to optimize the filtering algorithm.The Z-Index algorithm partitions the query graph based on the extension probability to enhance the filtering effect.Then,a hierarchical filtering mechanism is proposed to further simplify the candidate set.Finally,the zero index is constructed by combining the encoding compression technology to reduce the size of the index space,so as to reduce the query response time further.The experimental results show that the Z-Index algorithm obtains a tighter lower bound,the size of candidate set generated by Z-index algorithm can be reduced by about 50%,the improvement interval of query efficiency is 9.1%~78.8%.Moreover,the algorithm has good scalability with very small index space.(2)In the verification stage of graph similarity search,aiming at the problems of low similarity computation efficiency of graph edit distance algorithm caused by large mapping space,low precision of equivalent mapping definition and high computational complexity of lower bound,a high-efficiency graph edit distance algorithm based on mapping encoding,named EC-GED,is proposed.It can identify equivalent mappings with higher precision by introducing mapping encoding.Then,it defines an efficient edit cost heuristic function with low complexity.Finally,it combines the mapping integrity boundary to prune the search tree,which can reduce mapping space and improve search performance.The experimental results show that the EC-GED algorithm can be used for large-scale datasets and has better search performance.The search space can be reduced by up to 49%,and the speed can be improved by 29%.(3)Based on the above two algorithms,the similarity computation scheme is further optimized,and a similarity search system for pharmaceutical big data is designed and implemented.The system consists of user display module and similarity search module,and it provides two search modes.Based on the MapReduce,the system can realize similarity search in parallel,making it possible to apply the similarity search algorithm to large-scale data sets.The system has passed the functional and performance tests,and the test results are in line with expectations.
Keywords/Search Tags:Similarity Search, Big Data, Filtering Algorithm, Graph Edit Distance Algorithm, MapReduce
PDF Full Text Request
Related items