| Image caption is a task of generating corresponding text descriptions by extracting image features of a given image,which includes two important research areas of deep learning: computer vision and natural language processing.At present,mainstream image caption generation algorithms mainly use encoder-decoder framework,in which the encoder is used to extract image features,and the decoder is used to generate the corresponding text description.This thesis studies image caption algorithms based on the encoder-decoder framework.Aiming at the problems that still exist in these models,such as insufficient utilization of feature information by decoders and poor granularity of generated sentences,attention mechanisms are used to filter irrelevant information,making full use of language and image feature information.The main work and contributions of this thesis are as follows:(1)An image caption algorithm based on mixed attention mechanism is proposedIn view of the problem that the decoder does not fully utilize the feature information,improve the attention mechanism for image features,so that the decoder pays attention to the overall and local features at the same time.The attention mechanism(Ao A)based on the attention mechanism is introduced to perform more fine-grained attention calculations on image features,and the attention mechanism is mixed with the soft attention mechanism,so that the model can take into account both global features and local features.The results on the MSCOCO dataset show that the proposed MANet model increases the four indicators of BLEU-1,BLEU-2,BLEU-3,BLEU-4,ROUGE and CIDEr by 0.3,0.4,0.3,0.4,0.3 and 0.6 respectively compared with the baseline model,which demonstrates the effectiveness of the proposed model.(2)An image caption algorithm based on adaptive attention mechanism is proposedTo solve the problem of poor fine-grained sentences generated by the decoder,improve the attention mechanism for text features.Based on the image caption model based on the mixed attention mechanism proposed in chapter 3,an adaptive sentinel attention mechanism is introduced to select the currently generated word category,a new LSTM layer is added to generate initial sentences,and the language attention mechanism in the baseline algorithm is improved.The results on the MSCOCO dataset show that the proposed ASNet model improves by 0.2,0.3,0.2,0.1,0.1,0.3 and 0.1 compared to the baseline model on BLEU-1,BLEU-2,BLEU-3,BLEU-4,ROUGE,CIDEr and SPICE which demonstrates the effectiveness of the proposed model.(3)Design and implementation of image caption systemBased on the proposed model,combined with the theoretical basis of software engineering,an image caption system is designed and implemented.The system includes functions such as user registration,user login,user information modification,image caption,and historical records.Users can upload images to generate corresponding image captions. |