| Image captioning is a multimodal task that concerns computer vision and natural language processing.It aims to make computers automatically generate accurate image caption that describes the image content.It has broad application prospects in remote sensing images,visual assistance,human-computer interaction,and other fields.Current image captioning models based on deep learning basically adopt an encoderdecoder architecture incorporating attention mechanisms.However,existing attention mechanisms are unable to identify important regions and important visual features in images.This problem makes models sometimes pay excessive attention to non-important regions and non-important features in the process of generating image captions,which makes model generate coarse-grained and even wrong image captions.At the same time,existing visual encoders have only undergone visual pre-training,which results in significant semantic gaps between extracted visual features and text features.This problem is adverse for model to generate image captions with high quality.Aiming at above problems existing in the field of image captioning,this paper mainly completes the following work:(1)This paper proposes Multi-Level Discrimination Attention(MLDA)and an image captioning model MLDANet which is based on MLDA.Specifically,comparing with other mainstream attention mechanisms,MLDA can recognize important regions and important features in an image,and guide model to pay more attention to these regions and features in the process of generating image caption,so that MLDA can reduce the possibility of being misled by non-important regions and non-important features.In addition,MLDANet adopts the structure of CNN+LSTM+attention,and applies the attention mechanism to both encoder and decoder parts.Experimental results on MSCOCO dataset prove these conclusions:Comparing with other mainstream attention mechanisms,MLDA can more effectively improve the quality of image captions generated by the model;MLDANet which is proposed in this paper can generate high-quality image captions.(2)In order to reduce the semantic gap between visual features extracted by encoders and language features,as well as accelerate the speed of extraction of visual features and generation of image captions,this paper proposes a language decoder MLDA-Decoder based on the Transformer framework and an image caption model Vi LTCap on the basis of multimodal pre-training model Vi LT(Vision and Language Transformer)and Multi-Level Discrimination Attention.Specifically,comparing with CNN,Vi LT encoder extracts visual features faster and with higher quality;Comparing with Transformer Decoder,MLDADecoder has a lower algorithm time complexity.Finally,experimental results on MSCOCO datasets demonstrate that the model framework of Vi LTCap has certain advantages,as well as it can generate high-quality image captions at a faster speed. |