| Image captioning is the task of understanding the image and generating corresponding text descriptions by the computer.This task is an important part of cross modality understanding.It plays an important role in many fields such as security and anti-terrorism,autonomous driving and public opinion monitoring,and has great theoretical significance and application value.At present,there are some problems in the field of image captioning,such as the lack of higher-order relation,low accuracy of descriptions and unreasonable description content.This paper carries out related research from the generation and introduction mechanism of attention mechanism and visual common sense based on the Transformer.To address these problems,three image captioning models are proposed.The main research and work of this paper are summarized as follows:(1)An image captioning model that fuses multi-level features and dual attention is proposed.The model adopts the method of multi-level features and dual attention joint enhancement to model higher-order relationships of visual information to extract rich semantic features.The feature enhancement module can learn the context information that is hidden in the non-local space and different scales to obtain more detailed information.The dual attention module performs interactions on all elements to fuse channel and spatial attention,which can realize higher-order cross-modal interaction.The experimental results show that the BLEU-1,BLEU-4,METEOR,ROUGE,CIDEr and SPICE indicators of the model can reach 80.6%,38.8%,29.0%,58.7%,128.6% and22.7%,respectively,which higher than the Transformer baseline model.(2)An image captioning model enhanced by mesh cross-attention is proposed.The model establishes a net-like connection between the visual encoder and the language decoder,which constructs a net-like cross-attention module based on the cross-attention mechanism.The designed net-like cross attention effectively broadens the interaction range between the visual encoder and the language decoder,and can simultaneously focus on visual semantics and object attributes to generate more accurate image descriptions.Compared with the representative POS-SCAN model,the BLEU-1,BLEU-4,METEOR,ROUGE,CIDEr and SPICE metrics of the proposed model are improved by 0.4%,1.3%,0.6%,3.3%,4.5% and 0.7%.At the same time,the improvement of the CIDEr index indicates that the description sentences and the annotation sentences generated by the proposed model have a higher probability of appearing synonyms or original words,which validates the new model can effectively improve the accuracy of image captioning.(3)An image captioning model guided by attention and visual common sense is proposed.The model is composed of a confounder dictionary building module and a visual common sense guidance module.Introducing reliable causal relationships can control the rationality of the description content and improve the quality of image description.The confounder dictionary building module can extract the interference items that cause prediction deviation due to uneven data distribution and single data.The visual common sense guidance module can obtain the reliable causal relationship by eliminating the confounders,which obtains more accurately feature representation of the relationship between regions.The experimental results show that the proposed model achieves 80.9%,39.4%,29.4%,59.1%,130.8% and 23.3% on BLEU-1,BLEU-4,METEOR,ROUGE,CIDEr and SPICE,respectively.It shows that the model can accurately measure the relationship between triples and weaken the noise interference,which makes the description more reasonable,and improve the quality of image captioning. |