| Given a query object under either textual or visual modal, object retrieval and localization aim at finding it out from a large-scale multimedia dataset. Object retrieval and localization have been central problems in high level computer vision with a wide variety of applications.However, due to the presence of illumination, occlusion, background clutter, changes of camera viewpoints, object deformation and other internal or external factors, the target objects may appear quite different.Moreover, the correlations among instances, images and labels are diverse and complicated. Therefore, object retrieval and localization are very challenging.Starting from real-world applications, this thesis presents extensive research efforts on three relevant problems, i.e., instance search, tag-based search and object localization. We improve the bag-of-words model and deep convolutional networks in three different aspects, i.e., representation,learning and correlations. The contributions of this thesis are listed as follows:1. Bayes pooling of visual phrases for instance search. Firstly, we optimize the definition of visual phrases to improve the distinction of the visual phrases. Secondly, we analysis the problem of visual phrase burstiness, which is important for instance search but tends to be ignored.Subsequently, we propose a novel Bayes pooling strategy to address this problem from a probabilistic view. Extensive experiments demonstrate the superior performance of our proposed method compared with other visual phrase-based models.2. A locality aligned deep model for generic instance search. We propose a locality aligned strategy to deal with the asymmetrical similarity metric involved with instance search. Towards discriminative region representations, we utilize a deep convolutional network which captures both intra-class and inter-class distinctions of the regions. In addition, we propose a semi-supervised method to collect appropriate data to train the network, which requires no prior knowledge of the query object and very little manual annotations. Extensive experiments confirm that our method is suitable for generic instance search.3. Online multiple instance learning for image annotation and object localization under weak supervision. Firstly, based on multiple instance learning, we improve the region collection strategy. The proposed method collects purified positive training regions and discriminative hard negative regions, both of which improve foreground background distinction.Secondly, the region selection procedure is combined with the object detector learning within a unified framework, which allows for end-to-end training. Finally, through explicit correlation between instance categories and image tags, we address both image annotation and object localization within one model. Extensive evaluation results on PASCAL VOC 2007 and 2012 datasets are presented which demonstrate that the proposed method effectively improves image annotation and object localization.4. An attention-based deep representation model to optimize region-based object representation. The attention-based representation model highlights the activations on discriminative object parts and down weights the activations on irrelevant backgrounds of the image. As a result, object-background ambiguity can be largely reduced. The proposed object representation model can be seamlessly integrated into a state-of-the-art weakly supervised detection framework, and can be trained for better image annotation and object localization. |