| Image understanding is the cornerstone of image data and an important part of the intelligentization of computer vision.It is of great significance to design accurate and efficient algorithms of image understanding for both image related theoretical researches and industrial applications.However,due to the fuzzification of image data,there are uncertainty problems in knowledge representation,semantic annotation and object recognition of image understanding.Multi-view learning cooperatively processes multi-source information to form multi-form information complementarity,which has potential in solving the uncertainty problems in image understanding.In this thesis,we employ multi-view learning related technologies to carry out researches on the image feature representation,image clustering,image classification,image co-segmentation and image co-localization in image understanding.This work provides model supplement and performance optimization for the researches of uncertainty problems in image understanding.Specifically,the major contributions lie in the following aspects:(1)Multi-view image feature representation and clustering via nuclear norm minimization.Separately performing feature learning and clustering assignment is a widely-used method for multiview image clustering.However,the two-step pipeline has the potential problem that the learned features cannot well reflect the clustering structure.To this end,based on matrix factorization model,this thesis introduces orthogonality and nonnegativity constraints to formulate a joint framework of feature learning and clustering assignment.Meanwhile,through nuclear norm minimization,the compression of principal components is achieved for performance improvement and robustness enhancement.Extensive experimental results on real-world image and video datasets demonstrate that the proposed method obtains better clustering results than existing separate learning algorithms and outperforms other state-of-the-art non-separate models.(2)Joint learning based multi-view image feature representation and clustering.Most current multi-view spectral clustering methods employ Gaussian kernel to initialize the similarity matrix of each view.However,they ignore the local structure embedded in the matrices and the influence of noise from datasets in the follow-up optimization processes.Therefore,this thesis employs local embedding technology to explore the local structure of similarity matrices.Meanwhile,the nuclear norm constraint is imposed to capture the principal information of multiple views and improve robustness to noise/outliers.With the two techniques,a joint optimization framework for similarity matrix and low-dimensional embedding is formulated by integrating with a loss function.The optimal solution,which consists of the similarity matrix and low-dimensional representation,is finally applied in multi-view clustering.Extensive experimental results on real-world image and video datasets demonstrate that the proposed two clustering strategies have their own emphasis,and both of them achieve superior performance over other state-of-the-arts.(3)Embedding regularizer learning for multi-view image feature representation and semisupervised classification.Due to the lack of large-scale labeled data,most of the image classification methods stay in the theoretical research level.In addition,the existing researches focus on single-view data,and lack comprehensive analysis for multi-view data.This thesis starts with a widely-used linear regression model,derives its variant,then employs the variant in the embedding regularizer learning framework to conduct multi-view semi-supervised classification.The framework integrates diversity,sparsity and consensus to dexterously manipulate multi-view data with limited labels.Extensive experimental results on real-world image and video datasets demonstrate that the proposed method outperforms existing state-of-the-art multi-view semi-supervised classification methods in terms of performance,robustness and label-utilization capability.(4)Weak supervision learning for multi-view image feature representation and object cosegmentation.Existing image co-segmentation methods can be roughly categorized into interactive and unsupervised techniques.The marking workload of interactive methods and the segmentation accuracy of unsupervised ones are the main problems in image co-segmentation related work.This thesis introduces image boundaries as the weakly supervised background priors in unsupervised methods,which avoids the heavy marking workload of interactive algorithms and simultaneously improves the segmentation performance of unsupervised ones.The proposal of two multi-view fusion methods is benefit to explore internal consistencies within single image and correlations between multiple images,obtain more discriminative feature representation of foreground objects,and achieve accurate co-segmentation results.Extensively experimental results on two widely-used image datasets demonstrate that the proposed method achieves superior co-segmentation performance to the state-of-the-arts,with a significantly reduced time consumption.(5)Multi-view feature fusion for image co-localization.The existing co-localization methods are mainly low-level features based ones and lack the utilization of high-level features.To this end,this thesis takes CNN pre-training model as a feature extractor and formulates a multi-view feature fusion model to integrate the characteristics of different convolution layers,so as to realize the effective utilization of high and low-level features.The local embedding technology and the sparsity constraint in the model can effectively ensure the fused features have important information of each convolution layer and simultaneously suppress foreground-independent pixels.Extensively experimental results on three widely-used image datasets demonstrate that employing the fused features on image co-localization outperforms existing state-of-the-art methods.This further validates that the simultaneous advancement of CNN pre-training model and multi-view fusion can effectively realize the complementarity of high and low-level information,obtain more discriminative feature representations for images,and ultimately ensure the accuracy of localization results.Currently,the scale of datasets limits the performance of deep learning in image clustering,image semi-supervised classification,image co-segmentation and co-localization.Therefore,the methods based on traditional machine learning are still necessary.The contribution of this thesis can provide theoretical guidance and practical proof for the study of uncertainty in knowledge representation,semantic annotation and object recognition of image understanding.It also provides new ideas for the researches of other image understanding tasks and new assistance for the intellectualization of computer vision. |