| With the development of technology,more and more things that were once unimaginable have become a reality.In the era of big data,deep convolutional neural networks have more complex network structures than previous neural networks because they contain more hidden layers.Compared with various traditional machine learning methods,they have more powerful feature learning and representation capabilities.This makes convolutional neural networks widely used in the field of computer vision and they have shown impressive results in recent computer vision tasks.This thesis focuses on the hot topic in computer vision,which is image captioning,a combined application of natural language processing and computer vision.Its purpose is to train a model that can generate natural language descriptions of an input image.Various images are ubiquitous in life,and it is easy for humans to understand the content of images.As science and technology have advanced and deep learning has progressed,we are constantly exploring and researching how to enable machines to perceive the world by recognizing images,how to know what is in the image,how to describe it,and how to know what is in the world.This research is obviously positively meaningful for the development of artificial intelligence and human beings’ future.Currently,there are many application scenarios being tried,such as assisting the blind,automatic recognition of drowning and fire alarms,image retrieval,and robots that can communicate with humans in the future,all of which rely on image captioning.This thesis addresses the shortcomings of current image captioning methods,including inaccurate object recognition in images,long model computing time,and incoherent text sentences lacking contextual semantics.Several improvements are made to the encoder and decoder algorithms.It is experimentally verified that the proposed method can effectively improve the model’s execution speed and the accuracy of generated text.The main contributions of this thesis are as follows:(1)A Yolov5-based image captioning framework is established for image feature extraction.Yolov5 has achieved impressive results in the field of object detection and recognition.This thesis applies its excellent feature extraction and image encoding capabilities to the image captioning field,obtaining faster,more detailed,and better image encoding results than previous models.Experimentally,it has improved the model’s computational speed and evaluation score to some extent.(2)A composite information interaction BiLSTM algorithm is proposed,incorporating attention mechanism.Long short-term memory(LSTM)recurrent neural networks can analyze input through time series,but the current algorithm still has problems with incoherent text predictions.Therefore,this thesis introduces a bidirectional LSTM as a caption generator and incorporates self-attention mechanism into it.The Yolov5-Soft Attention-BiLSTM model is established to extract and encode features from the dataset.It is experimentally verified that the introduction of BiLSTM combined with attention mechanism makes the model better recognize the current target being described,resulting in more coherent and image-related descriptions and higher scores.(3)A COCO dataset preprocessing model is established,such as discarding lowfrequency words and filtering stop words.It is experimentally verified that this method effectively improves the model’s accuracy and computational speed,achieving better performance than before.(4)Structural local adjustments and optimizations are made to the Yolov5 and BiLSTM models.Experimentally,it is verified that these improvements have a certain effect on improving the accuracy and computational speed of image captioning algorithms.Experimental results show that the proposed algorithm in this thesis outperforms previous state-of-the-art methods in terms of model accuracy,computational speed,and evaluation scores. |