Font Size: a A A

Research On Image Caption Generation Based On Global And Multilevel Feature Extraction

Posted on:2023-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:X D HanFull Text:PDF
GTID:2558306845490844Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Image caption generation is the interdisciplinary research of computer vision and natural language processing,usually using encoder-decoder framework,the encoder extracts image features from the input image,the decoder uses image features to generate corresponding text descriptions,so the image feature extraction is the basis for generating text descriptions.Currently,image caption generation suffers from the problem of insufficient image feature extraction,which fails to expand the capability of reasoning in image caption.In this paper,we address the above problems by using Faster RCNN and Transformer models in image global and multi-level feature extraction,and the specific work is as follows.(1)GE-FTRAN model for image caption generation based on image global feature extraction is designed.The GE-FTRAN model is based on the Basic-FTRAN model.The encoder of the Basic-FTRAN model is composed of the Faster RCNN and the Transformer encoder for image feature extraction,and the decoder is the Transformer decoder for text description generation.The GE-FTRAN model encoder generates image global features from the average pooling of regiona features extracted by Faster RCNN,and then input each image region feature and global features to the Transformer encoder for learning,outputting image regional features as well as more comprehensive global features.An adaptive extraction module is designed to extract global features of each encoder layer and perform weighted fusion.The GE-FTRAN model decoder,designed a global feature adaptive guidance module based on the Transformer decoder to jointly guide the model to generate text descriptions using both global and region features.Experimental results using cross entropy loss and reinforcement learning in two stages of training showed that the GE-FTRAN model improved the BLEU-1,METEOR,ROUGE and SPICE evaluation index scores on the Microsoft COCO Caption dataset by 0.8%,1.1%,1.6%,and 1.3%.(2)MLE-FTRAN model for image description generation based on images multilevel feature extraction is designed.MLE-FTRAN model is based on Basic-FTRAN model.MLE-FTRAN model encoder inputs each image region features into Transformer encoder for learning,and outputs the image region features containing multi-level region relationship information using multiple Transformer encoder layers.MLE-FTRAN model decoder,designing a multi-level cross attention mechanism based on the multi-head cross attention mechanism at the Transformer decoder,using multilevel image region features to jointly guide the model to generate text descriptions.The experimental results of the two-stage training show that the MLE-FTRAN model improves the BLEU-1,METEOR,ROUGE and SPICE evaluation index scores by0.5%,0.8%,1.4% and 1.2% respectively over the Basic-FTRAN model for the above dataset.In this paper,the GE-FTRAN model was also linearly superimposed with each module of the encoder and decoder of the MLE-FTRAN model,which is noted as the GMLE-FTRAN model,the experimental results of the two-stage training showed that the GMLE-FTRAN model is better than the GE-FTRAN model in the above dataset,BLEU-1,METEOR,ROUGE and SPICE evaluation index scores improved by 0.1%,0.3%,0.1% and 0.2%,respectively.
Keywords/Search Tags:Image Caption Generation, Transformer Model, Image Global Feature, Image Multilevel Feature
PDF Full Text Request
Related items