| Semantic description of images is the generation of a natural language description of an image by a computer that conforms to the logic of human language.It is a cross-discipline between computer vision and natural language processing and has important applications in helping visually impaired patients,intelligent driving,and the understanding of traditional ethnic costumes.The ability of humans to describe the content of an image in language is fundamental,but for computers it faces many difficulties.Therefore,image semantic description as a multimodal learning problem requires accurate recognition of the properties and relationships between objects in an image,as well as the accuracy and richness of the syntax of the descriptive statements.At present,researchers have made some achievements in image semantic description research,but there are still problems such as insufficient extraction of image features,mismatch between description statements and the main content of the image display,and lack of emotional colour in image description statements.To address the above problems,this paper optimises and improves on the original deep learning model,as follows:1.A ViT-based image feature extraction model is proposed for the network’s ability to extract image features.In this model,local features and global features in the image are extracted using Res Net Xt-101 network and Vi T network,and the global features are combined with the local features to form the image visual features,and then the features are input into the decoder to generate the description statement of the image.The results of the simulation experiments show that the model is based on Res Net Xt-101 network combined with Vi T network to fully extract the image visual features and lay the foundation for the image semantic description.2.To address the accuracy and richness of image description statements generated by the model,an image description algorithm combining the channel attention mechanism is proposed.The image visual features extracted by the Vi T network and Res Net Xt-101 network are optimised for the unclear regions in the image through the channel attention mechanism,so that the significant regions in the image are assigned larger weights and the insignificant regions are assigned smaller weights to make the image features extracted by the model more accurate,and finally the visual features of the image processed by this attention mechanism are combined with the text features and input to the image decoding Finally,the visual features of the image processed by this attention mechanism are combined with text features and input into the image decoding module to generate image description statements.The simulation results show that the description statements generated by this method match the main content of the image display,making the statements of the network describing the image more accurate and richer.3.To address the problem of lack of emotional colour in image description statements,a generative adversarial network-based image semantic description method is proposed,which consists of a generator model and a discriminator model.Firstly,the input image is passed through the generator model to generate a description statement,and the discriminator model determines whether the description statement is a true statement or a false statement,which is fed back to the generator model through a reward mechanism.Secondly,a corpus of emotions is added to the generator model so that the generated image description statements are emotionally charged.The simulation results show that the description statements generated by this method are more vivid and rich in emotion. |