Content-based image retrieval is one of the basic research topics of computer vision.The objective of image retrieval is to identify the images relevant to the query image from a large-scale image database,and return the images to the user in descending order according to the relevance.Image retrieval has been widely used in product retrieval,landmark retrieval,person retrieval,and other fields.Visual re-ranking is an important post-processing step in content-based image retrieval.If the result of the first-round retrieval is not satisfactory,the visual re-ranking techniques can be used to optimize the initial retrieval result,which improve the precision of image retrieval.Nowadays,visual re-ranking techniques have made great progress,and they have become the core module in many image retrieval systems.However,there are still some urgent problems which cannot be ignored in visual re-ranking.First of all,the existing re-ranking techniques fail to deal with the top-ranked irrelevant images.However,many re-ranking methods obtain additional information about the query image from the topranked images,which has a negative influence on the improvement of precision during the re-ranking.Secondly,due to the high computational complexity of most existing methods,it is difficult to achieve high precision with low latency.Finally,many reranking methods have poor robustness and can only be applied to specific features or tasks,which is hard to deal with the complicated situations in real image retrieval systems.These problems limit the application of visual re-ranking methods in real image retrieval systems.In order to solve the above problems,this thesis proposes a new semantic relevance learning framework for visual re-ranking,which aims to learn the semantic relevance between images according to the contextual similarity in formation.The proposed methods improve the precision of image retrieval and ensure low computational overhead.The main contribution of this thesis is summarized as the following three aspects:Firstly,this thesis proposes a visual re-ranking method based on Collaborative Relevance Learning(CRL).In the first-round retrieval results,the top-ranked images contain important contextual information,which can be used to learn the semantic relevance between images.Therefore,this thesis proposes a collaborative semantic relevance learning method,which uses a more accurate similarity measurement between query and top-ranked images from the first-round retrieval and improves the retrieval precision.Specifically,we represent the image set of a fixed-length retrieval list into a correlation matrix,and learn the relevance of all image pairs simultaneously with a lightweight CNN model.To find the optimal length of retrieval result lists for different queries,we present a query-sensitive selection method.In sum,the CRL method can promote the precision of image retrieval with minimal computational overhead.Secondly,this thesis proposes a re-ranking method based on Contextual Similarity Aggregation(CSA).In image retrieval,contextual similarity information between top-ranked candidate images from the first-round retrieval is an important clue to distinguish semantic relevance.In this method,an affinity feature is defined to represent the contextual information among the candidate images.In order to further aggregate the contextual similarity information of candidate images,we design a network based on the Transformer encoder to learn the relevance between images and aggregate the affinity features of candidate images.Since the proposed re-ranking model takes the affinity features as input,the re-ranking network has good robustness.In sum,the CSA method can promote the precision of image retrieval by a large margin with low latency.Lastly,this thesis proposes a re-ranking method based on Adaptive Query Expansion Learning(QEL).In the traditional query expansion method,a monotonically decreasing weight function about the ranks of images is generated for feature aggregation,but this scheme ignores the semantic relevance among images during aggregation.To solve this problem,this thesis formulates the query expansion problem as a representative learning problem,which trains a network to generate the expanded feature of query.We use the Transformer encoder to learn the relevance between query and top-ranked images,which is used for the following feature aggregation.In order to constrain the expanded feature to share the same embedding space with the original visual features,we modify the structure of Transformer encoders.This method does not produce monotonically decreasing aggregation weights,but uses the network to directly generate the expended feature of query.In sum,the QEL method promotes the precision of retrieval system with limited computational overhead.In this thesis,we propose three different methods to learn the semantic relevance between images according to the similarity between them,which is used to re-rank the results of first-round retrieval.This thesis will elaborate the above three works,and demonstrate the effectiveness,high efficiency,and robustness of the proposed methods with comprehensive experiments. |