| High resolution remote sensing images have significant application in many fields,such as military,agriculture,and mining.With the maturity technology for obtaining remote sensing image,sufficient remote sensing image datasets can provide reliable support for object detection,tracking,and understanding.Semantic understanding of remote sensing image content and existing changed information is hot research task.The remote sensing image captioning and remote sensing image change captioning are appeared on application requirements.Due to the difficulty in obtaining the unique feature for remote sensing images and the lack of semantic prior knowledge,the gap between the two modalities cannot be effectively weakened.Therefore,how to design an effective feature extraction module and introduce semantic information are the key technical points in remote sensing image captioning and remote sensing image change captioning.For remote sensing image captioning task,existing models are with insufficient accuracy,lack of rich feature extraction and semantic guidance.For achieving performance improvement,interpretability,and transferability,the unique vision features,effectively interaction between visual and semantic information are fully considered.Facing the task of describing changes in temporal remote sensing images,we are committed to exploring robust change feature localization and extraction to address the limitations of extracting change features,providing rich and authentic change information for decoders and further developing this task.The main content and related innovative contributions are introduced as follows:(1)In deep learning based remote sensing image captioning task,the visual feature extraction neglects the characteristics of high-resolution remote sensing image and limits interactions between image content and semantic vectors.Thus,a novel recurrent attention and semantic gate framework is proposed.In order to fully considering the God’s perspective and the varying scales of remote sensing images,a multi-scale feature extraction module based on dilated convolution is constructed to express visual information.Then,the recurrent attention mechanism models both visual features and non-visual features in the decoder simultaneously,further improving the decoder’s recognition and focusing on the effective information at the current time.Finally,combining a semantic gate enhances the understanding and inference of implicit semantic feature vectors,which should aim to sufficiently understand the complex content and alleviate the problem of category confusion.The experimental results indicate that the effective feature extraction and the cross-modal interaction module can effectively improve the performance of remote sensing image description.(2)In attention-driven remote sensing image description methods,visual features are directly and explicitly guided models,which has two problems: the static features do not create relationship between targets and contain redundant scene information;On the other hand,attention mechanisms are insufficient to capture effective information in static visual features.Therefore,a remote sensing image description algorithm based on feature enhancement and cyclic attention is proposed.In order to understand relationship information among targets contained in remote sensing images,a feature enhancement module enhances the visual area with rich information.In addition,alleviating the issue of potentially invalid attentive region information,an adaptive cyclic attention mechanism is proposed.This attention mechanism is based on single or multiple attention steps for region selection,with termination by confidence and maximum attention step,attempting to match the most effective guidance information for each inference stage of the decoder.The adaptive attention module proposed in this chapter can be directly applied to existing cross-modal tasks,which can improve the sentence generation quality of remote sensing image description models without increasing inference costs.The experimental results on three benchmark datasets demonstrate that the model has achieved consistent performance improvement compared to other advanced algorithms.(3)For the encoder-decoder-based remote sensing image captioning methods,attributedriven methods introduce additional semantic label information which obtained by pretrained multi-label networks.However,the uncontrollable generation of misleading labels reduces model performance.Thus,a novel visual-semantic interaction framework for remote sensing image captioning is proposed.A high-level mapping from remote sensing image content to semantic features is constructed,a trainable semantic concept extractor is utilized to obtain semantic concepts of remote sensing images.The design of the visualsemantic co-attention module can learn coarse-grained semantic-related regions(visual context vectors)and region-related semantics(semantic context vectors),achieving multimodal interaction and alignment.Finally,visual context,semantic context and semantic relational features are introduced into the consensus exploitation module based on graph convolutional neural network,so as to better realize visual-semantic consensus awareness and reveal the semantic understanding on cross-modal domains.End-to-end training is utilized to optimize model parameters.The experimental results show that multi-modal interaction is more meaningful for constructing remote sensing image description model,reaching the state-ofthe-art performance compared with relevant algorithms at present.Note that the embedding of high-level semantic information makes important contributions.(4)The dominant remote sensing image captioning methods are mostly based on the attention mechanism,lacking target-level visual features to guide model.And decoder with Long Short-Term Memory is inevitable when the input sequence is longer.Aiming to solve above two problems,patch-level salient aware and multi-label reinforced for remote sensing image captioning is proposed.Considering the target-level information exists in patch block from Transformer encoder.Thus,the target-level salient region is designed for better visual perception of targets.The trainable multi-label classifier can exploit implicit semantic knowledge as a complement for object-level features.Note that it is crucial for modelling the relationship between cross-modal features.Therefore,the cross-modal association attention module contains two parallel multi-head attention modules: one branch for salient region features and the other branch for embedding semantic features.Sufficient experimental results show the superiority of the designed algorithm on the three public datasets.(5)For the complex surface changes contained in temporal remote sensing images,which are obtained in the same area.In order to help people understand change information flexibly,the remote sensing image change captioning has become a special hot task.Currently,there are few researches about this task.The main challenge comes from how to learn trustful change features and address the vision-text gap.Suppose that the change features should be symmetric,and this chapter proposes a symmetric intertemporal network.Specifically,the temporal cross-attention mechanism realizes the interaction between oringinal temporal features,motivating the internal feature representation,coupling of difference information and suppressing irrelevant interference.Secondly,the symmetric difference transformation module is used to realize symmetric change features with “before-to-after”and “after-to-before” change features.A loss is adopted to learn strong discrimination change features,effectively alleviating defects of inaccurate change regions and poor stability due to the unidirectional change features.The experimental results on the Dubai-CC and LEVIR-CC datasets show that the proposed framework achieves excellent performance improvement. |