| Remote sensing image captioning task aims to allow the computer to distinguish and comprehend the content of the image and generate homologous description sentences automatically,which combines the two fields of computer vision and natural language processing.It plays a key role in many application scenarios of remote sensing technology,such as military intelligence generation,information retrieval,resource investigation,disaster detection,etc.Different from image comprehending tasks such as recognition and object detection,image captioning not only needs to identify objects and attributes in the image,but also needs to establish the relationship between them,and generate natural language description sentences in accordance with human norms.Benefiting from the vigorous development of artificial intelligence,the effects of deep neural networks in feature extraction have greatly ameliorated the quality of the generated description sentences.However,the remote sensing image have the problems of large scene imaging,complex and diverse background,multi-scale,rotation characteristics and semantic ambiguity,which further increase the difficulty of image captioning.In the thesis,based on the encoder-decoder,a remote sensing image captioning model MLVA-NET based on multi-level attention and visual adaptation is proposed to solve the problems of difficult semantic understanding and multiscale of remote sensing image.The main work includes:Aiming at the multi-scale and category ambiguity of remote sensing images,the thesis employs a multi-level attention module in the encoder to optimize the visual features of images extracted by CNN,and obtain more abstract deep image features.It uses spatial and channel attention mechanisms to learn features of specific locations and different scales of the image to improve the performance of the model.Aiming at the problem of the loss of visual information of the convolutional layer in the propagation stage of CNN makes it difficult for the network to learn the complete semantic information of the image.The thesis designs a contextual attention module in the encoder to incorporate multi-level features,which integrates low-level and high-level features of CNN.It can achieve the information complementation between the local feature and the global feature from the semantic information,and increase the diversity of image description sentences.Aiming at the problem of semantic ambiguity between remote sensing image visual features and text attribute information,the thesis proposes a visual adaptive LSTM decoder,which employs a visual sentinel mechanism to achieve the adaptive selection of visual information and contextual information for generating more discriminative description sentences,which improves the accuracy of image description sentences.Finally,the thesis verifies the effectiveness of the proposed MLVA-Net model from the perspective of quantitative and qualitative,through ablation experiments,comparative experiments and visualization results.Five commonly used metrics of image captioning evaluate the MLVA-Net model on four datasets of UCM-Captions,Sydney-captions,RSICD,and NWPU-captions.The experimental results demonstrate that the performance of the proposed MLVA-Net has strong robustness and generalization.It can generate more discriminative description statements from remote sensing images with complex backgrounds.In addition,the multi-level attention is used to increase the attention to smaller areas,and the visual sentinel realizes the semantic alignment between image and text.It can generate more accurate and richer description statements of remote sensing images. |