| Image description task is one of the most challenging research topics in the field of artificial intelligence.Its main task is to enable the computer to recognize and understand the content of the image and automatically generate corresponding text description sentences.Human beings can understand and describe the information contained in an image,and it is of great practical significance for computers to have the human ability.It also has a very wide range of applications in real life,such as intelligent human-computer interaction,early childhood education,information retrieval,and assistance for the visually impaired.In recent years,image description technology based on deep learning has developed rapidly,especially the use of neural networks,which has greatly improved the performance of image description models.By analyzing the image description technology based on deep learning and drawing on the achievements made by neural machine translation tasks,this paper proposes an image description model that is different from the current mainstream methods to solve the inaccurate description in complex scenarios.The content includes the following:(1)In order to fully extract the semantic information contained in the image,in view of the loss of visual feature information during the propagation of the convolution layer,the model cannot fully understand the semantics of the input image,this paper proposes a method of merging multi-model cross-layer features,Fusion of low-level features and high-level features,and training multiple encoders to extract features,to achieve complementary information between semantic features and detailed features,so as to learn more vivid and specific description sentences.(2)Natural scene images often contain multiple targets and complex background information.The corresponding description sentences are usually long sentences with complex structures.The current mainstream methods(using RNN or LSTM)are generally effective in extracting the semantic information of long sentences,and are easy to ignore.The basic hierarchical structure of sentences is not effective for learning long sequences of words.In order to solve this situation,this paper designs a causal convolutional neural network structure to extract text features and achieve effective learning of long-sequence words.The experimental results show that the model improves the ability to describe images containing complex scene information.(3)In view of the performance limitations of a single attention model when capturing information and the influence of multi-layer attention learning ideas,this paper proposes a CNN language model that incorporates multi-layer attention for image description.The concept level of text,by using multiple attention models in the language module to guide the causal convolutional layer when processing text information,so that the model can obtain additional vision during each convolutional text operation at each time step information.In addition,we validates the effectiveness of the proposed method from quantitative and qualitative perspectives,means of visualizing feature maps,and ablation experiments.Using different evaluation methods to verify the model on the two datasets of MSCOCO and Flickr30 k,the experimental results show that the model proposed in this paper has good performance,can effectively extract and save semantic information in complex background images,and has the ability to process long sequences of words,Multi-layer attention module pays more attention to the feature information of small area,more accurate description of image content and richer information expression. |