| The rapid development of sensor technology has brought various observation data with different modalities for humans,such as hyperspectral images,multispectral images,infrared images,synthetic aperture radar,light detection and ranging,panchromatic images,and so on.How to fully utilize these data for comprehensive,fast,and accurate analysis has always been the focus of researchers in the remote sensing field.With the acquisition of massive data,the differences between different modalities of data can achieve complementary advantages and effectively improve the classification and recognition accuracy of various tasks such as remote sensing land cover classification.Remote sensing multi-modal data fusion has been applied to many areas such as agricultural monitoring,urban planning,national defense security and other aspects,and is one of the current hotspots in remote sensing research.However,there still remains some problems in the current remote sensing multimodal fusion as follows: 1)at the data level,existing methods often ignore the extraction of local neighboring information around pixels when capturing global spatial-spectral information of multiple modalities;2)at the feature level,existing feature fusion methods mostly limit to a single scale and cannot achieve fine-grained fusion of multi-level local and global information;3)at the task level,existing methods treat feature extraction and feature fusion as relatively independent processes,and do not fully consider the complementary information between multimodal data during the feature extraction process.To address the above problems,this dissertation fully utilizes the advantages of graph models in representing global relationships,and conducts research in the following aspects:(1)To address the problem that it is difficult to simultaneously capture spatialspectral features and pixel neighboring information at the data level,a multi-modal data fusion method based on a spatial-spectral graph network is proposed in this dissertation.A spatial-spectral graph network is proposed to establish the association between modalities from the data level and extract features of multi-modal data from the perspective of local spatial constraints and spectral-spatial proximity.The proposed network includes a local module and a global module,where the local module uses a convolutional neural network to maintain the local spatial relationship between pixels.The global module constructs a spectral-spatial multi-modal graph to preserve the spectral-spatial neighboring information in multi-modal data.Finally,it generates a comprehensive multi-modal data representation.(2)To address the problem of feature extraction within limited level,which makes it difficult to balance global features and small-scale information,this dissertation proposes a multi-scale feature fusion method based on graph encoder-decoder network,which uses a graph model to maintain the global sample association while fusing multiscale features,and at the same time,extracts both local detail information and global information.The graph encoder maps multimodal data from multiple different scales to the graph space and completes feature extraction in the graph space.The graph decoder maps features of multiple scales back to the original data space and completes multiscale feature fusion.(3)To address the problem of lacking consideration of the modality correlation during feature extraction,this dissertation proposes a graph fusion network for multimodal data classification,which uses feature fusion to guide the feature extraction process.Specifically,the network takes multi-source data as input and directly outputs unified features containing fused multi-modal information.In this process,a multimodal topology graph is constructed to extract the complementary information between modalities,and two graph-based loss functions,Laplacian loss and t-SNE loss,are used to constrain the feature extraction process. |