| Image fusion is one of the important branches of information fusion,the aim of which is to extract and integrate the effective features from multi-source images captured in the same scene into a fused image with comprehensive information.The fused image usually shows higher visual quality,facilitating the subsequent observation and machine understanding.Currently,image fusion technology has been extensively applied in various fields including military,healthcare,daily monitoring and abnormality detection.Recently,deep learningbased image fusion methods have proliferated and made great progress,due to the superior ability of neural network in learning complex relationship among data and extracting intrinsically semantic features of high abstraction.Based on the deep learning technology and taking multi-modal image fusion as the main object of study,this paper discusses the disadvantages in existing deep learning-based image fusion methods and put forward a series of solutions surrounding the latent representations and commonly adopted deep generative models to realize the multi-modal image fusion.Specific research points are as follows:(1)The public infrared and visible image datasets do not have the physical ground truth fused images,which brings a great challenge for the data-driven deep learning methods.Meanwhile,existing methods conduct fusion process on low-level spatial feature,unable to overcome the modality gap.Through analyzing the imaging characteristics of images,this paper proposes an imaging simulation model for infrared and visible image using RGB-D dataset and atmospheric scattering model.Further,a conditional generative adversarial network(GAN)-based latent regression network is proposed for infrared and visible image fusion under the supervised setting(Lat RAIVF).The experimental results demonstrate the proposed method can effectively transfer the salient information from the source images and generate the fused images with good visual quality,as well as the rationality and effectiveness of proposed imaging simulation model.(2)Based on the synthetic infrared and visible image dataset,the network architecture for image fusion is further researched.Existing methods fail to consider the modality correlation and learn the infrared and visible features separately,besides,single and fixed fusion rule easily leads to the loss of discriminative information and limit the generalization ability of methods.This paper proposes an image fusion framework based on hierarchical feature correlation and attention mechanism(HAFGAN).An adaptive feature fusion block is designed to build the cross-modal interaction to realize the effective joint feature learning.The quantitative and qualitative comparison results demonstrate the effectiveness of joint learning of hierarchical features,which clearly shows that the fused results are more consistent with human perception.(3)Self-supervised learning utilizes the training data as labels,exploiting the inherent distribution characteristics and correlations of images.Self-supervised generative learningbased image fusion methods can only learn the pixel-level abstraction,thereby the extracted feature is not robust.While existing self-supervised adversarial learning,namely the GANbased methods are not stable,and their designed loss functions cannot represent the salient information effectively.This paper proposes a novel and simple self-supervised adversarial learning strategy for infrared and visible image fusion(XFc GAN),where a conditional discriminator is designed to distinguish the positive and negative sample of infrared-visible pairs.Besides,an adaptive local contrast-based structural similarity loss function is designed to guide the generation of fused image.The experimental results demonstrate the proposed method show superior ability in transferring texture details and preserving integral structure of source images.(4)The end-to-end neural network is a typical black-box,thereby deep learning-based methods usually lack of interpretability.The existing auto-encoder based methods neglect or cannot make full use of the modality characteristics and the prior knowledge of complementary-redundant relationship.A multi-modal image fusion framework based on disentangled latent representation is proposed,with a complementary group LASSO penalty term and a redundant consistency term to improve the disentanglement performance.The visualization of extracted feature demonstrates the proposed framework is able to learn more interpretable feature expression,facilitating the design of targeted fusion rule to realize accurate integration of information in fused images. |