Font Size: a A A

Research On Image Saliency Detection Model Theories And Approaches For Different Modals

Posted on:2023-09-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y H LiangFull Text:PDF
GTID:1528306851473004Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Visual saliency is an important research topic in visual perception and scene understanding,involving many disciplines such as cognitive neuroscience,cognitive psychology and computer vision.The salient regions of a scene usually contain important objects of human interest and can attract visual attention in a short period of time.The purpose of salient object detection is to find these objects or regions of interest in a given natural image.In recent years,salient object detection has become a hot research topic and attracted more and more researchers’ attention.As a basic and significant task,salient object detection is introduced into image processing to automatically locate,predict and mine important visual information that conforms to human cognition,and filter out unimportant background information.It can improve the efficiency of information processing,and reduce the computational burden of the model.At the same time,saliency detection can provide effective prior guidance information,which can be applied to weakly supervised semantic segmentation,object tracking,image editing and other tasks to assist their implementation.In addition,with the rapid development of software and hardware technology,salient object detection is playing an increasingly important role in cutting-edge fields such as automatic drive,industrial robot and human-computer interaction.It can be seen that the research of saliency detection algorithm has wide application prospect and far-reaching scientific significance.According to the different modals of processed images,salient object detection is derived from several different sub-branches,including single-modal saliency detection with RGB image as input;multi-modal RGB-D saliency detection with RGB and depth map as input;multi-modal RGB-T saliency detection using RGB and thermal infrared image as input;multi-modal light field saliency detection with all-focus and focus slices as input.In the field of saliency detection,the research based on RGB image has made great progress.However,there are still some problems that need to be further solved,such as incomplete segmentation of salient object,and rough and blurred edges of extracted object.Compared with RGB single-modal saliency detection task,research on multi-modal saliency detection needs to be further developed.In particular,the saliency detection based on RGB-T and light field is still in the early stage of research.Based on these,in this paper,we rely on effective deep learning theory to carry out relevant research on saliency detection of the above four different input modals,and devote ourselves to proposing accurate and robust detection algorithms.The main work and contributions of this thesis are as follows:1.SDCLNet: Semantic and detail collaborative learning network for RGB saliency detectionTo obtain more accurate saliency prediction maps,current methods mainly focus on aggregating multi-level features with structures like U-Net,and introducing edge information as auxiliary supervision.Different from the focus of existing methods,we study the different roles of semantics and details in saliency detection.The task is decomposed into two parallel sub-tasks: internal semantic estimation and boundary detail prediction.And these sub-goals are optimized simultaneously via explicit constraints.Specifically,a backbone network with an additional layer is first adopted as a shared encoder to extract multi-scale features from each RGB image.Then,two asymmetric decoders are designed.The semantic decoder generates a coarse semantic mask,and the detail decoder generates a fine-grained object boundary.Finally,a collaborative learning block adaptively selects discriminative features for saliency prediction.In this way,semantic features and detailed information can be fused effectively to generate accurate and consistent saliency map.Extensive experiments on six benchmark datasets demonstrate the effectiveness and superiority of the proposed model in terms of subjective visual perception and objective evaluation metrics.2.CMPNet: Cross-modal multi-enhanced pyramid network for RGB-D saliency detectionDepth map contains geometric clues,which can provide valuable supplementary information to improve the performance of saliency detection.Most existing saliency detection methods based on RGB-D mainly adopt early fusion,late fusion or middle fusion schemes to explore the correlation between RGB image and depth map.However,these fusion strategies fail to adequately capture cross-modal and multi-scale fusion features.To this end,we propose a novel multi-modal enhanced pyramid network based on multi-stream structure for RGB-D saliency detection.Specifically,RGB,depth and their combination are first used as inputs to the three-stream backbone,which explicitly captures the individuality and commonality of the two modals.Then,the designed cross-modal feature multi-enhancement block encourages comprehensive interactions of cross-modal features from three sources at each level,thus forming a multi-modal pyramid feature.Furthermore,to focus attention on high-level semantic features and low-level spatial structural features,a multi-scale feature attention block is proposed to handle different levels.Finally,the features of different levels are integrated by cross-level fusion attention block,and the predicted saliency map is generated.Experimental results show that the proposed algorithm outperforms other algorithms in the same period on five challenging benchmark datasets.3.MIA-DPD: Multi-modal interactive attention and dual progressive decoding network for RGB-D and RGB-T saliency detectionRGB-based saliency detection algorithm is unsatisfactory when dealing with challenging scenes such as ambiguous object contours,low contrast between foreground and background.To alleviate this problem,a saliency detection task based on RGB-D or RGB-T has been proposed.Currently,however,they are usually treated as two separate visual tasks.Moreover,most of these methods directly extract and fuse raw features from the backbone.We explore the potential commonalities between these two tasks and propose an end-to-end unified framework that can be used for both RGB-D and RGB-T saliency detection.Specifically,multi-modal interactive attention unites effectively capture rich multi-layer context features from each modal,which serve as a bridge between feature encoding and cross-modal decoding.Joint attention-guided cross-modal decoding module and multi-level feature progressive decoding module gradually integrate complementary features from multi-source features and different levels of fusion features,respectively.Experimental results on RGB-D and RGB-T datasets show that the proposed algorithm performs well in terms of detection accuracy and model generalization compared with previous algorithms.4.DGENet: Dual guidance enhanced network for light field saliency detectionSaliency detection models using light field data as input have not been thoroughly explored.Existing deep saliency models usually treat multi-focus images as independent information and extract their features separately,which may be cumbersome and over-rely on well-designed network structure.In addition,they do not fully explore the cross-modal complementarity and cross-level continuity of information,and rarely consider edge cues.Based on the above observations,we propose a dual guidance enhanced network that considers both spatial content and explicit boundary cues.Specifically,the proposed model consists of two key modules:recurrent global-guided focus module and boundary-guided semantic accumulation module.The former is used to distill out effective squeezed information of focal slices and RGB images between different levels.The global context features guide the network to focus on the salient region through the progressive reverse attention-driven strategy.The latter introduces salient edge features to guide the accumulation of salient object features to generate saliency map with sharp boundaries.Experimental results on three benchmark light field datasets show that the proposed algorithm outperforms state-of-the-art 2D,3D and 4D methods,and can more effectively guarantee the integrity and sharpness of object contours.To sum up,in this thesis,a series of data-driven models for processing different image modals are proposed,and their effectiveness is verified by combining theoretical analysis with experiments.These algorithms enrich the research in the field of visual saliency detection and promote the development of image saliency detection with different modals.In addition,this thesis presents the problems and challenges faced by the current saliency detection algorithms for different modals,and looks forward to the future research trends in this field.
Keywords/Search Tags:Deep learning, Salient object detection, Multi-modal, Feature extraction and fusion, End-to-end learning, Encoder-decoder structure
PDF Full Text Request
Related items