| With the advent of the era of big data,the demand for data analysis in various industries and fields is expanding,and the required data mining technologies are developing rapidly.The retrieval technology for large-scale high-dimensional datasets is an essential and important foundation.In database,recommendation system,machine learning and other computer professional fields,nearest neighbor search supports many important applications.Accurate nearest neighbor search technologies,which perform well in low dimensional spa-ce,has a serious decline in efficiency in high dimensional space.Therefore,researchers turn to the study of approximate nearest nei ghbor search.Among them,graph-based algorithms perform best in query efficiency.Especially for the query with high recall rate,they are several times or even tens of times more efficient than other algorithms.In view of the existing problems of graph-based algorithms,this paper puts forward various improved designs,and we list them as follows.Monotonic search network(MSNET)is the most efficient in current graph-based algorithms.It is dedicated to realize monoton ic search by minimizing the average out-degree of nodes,thus reducing the length of search path to maximize the query efficiency.However,according to our inference,the current design of monotonic search network can not guarantee the realization of mono-tonic search.According to our research,the complexity of high-dimensional space and the unpredictability of query vector are two main factors leading to the failure of monotonic search.Therefore,we propose an improved design of monotonic search network,called(r,p)-MSNET.When the distance between the query vector and its nearest neighbor is less than r,the(r,p)-MSNET guarantees that the probability of monotonic search is not less than p.Due to the high time complexity of establishing a(r,p)-MSNET,we propose an approximate implementation,called Tree-based Search Graph(TBSG).Experiments conducted on several commonly used high-dimensional vector datasets show that TBSG is significantly more efficient in query than other algorithms and has a faster index building speed.In addition,further experiments show that transplanting the neighbor selection strategy of TBSG to other graph-based algorithms can effectively improve their query efficiency.In order to solve the problems of large amount of memory cost by index and slow index construction in current graph-based algorithms,we propose a distributed index design.The distributed index design refers to dividing the original large-scale dataset into multiple sub datasets and assigning them to different hosts and constructing sub indexes respectively.When performing the query,the algorithm sorts the sub indexes according to its distance to the query vector,implements the query only in several of them according to the target recall rate,and finally returns the aggregated result.The distributed index design can significantly reduce the memory pressure of single node and improve the scalability of graph-based algorithm on very large scale datasets. |