Font Size: a A A

Image Captioning By Multi-feature Fusion

Posted on:2023-06-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2568306830981279Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Image captioning is a cross-domain problem involving computer vision and natural language processing.It has been extensively studied in recent years,forming a series of typical methods based on the encoder-decoder framework.The image captioning method based on the Transformer model has greatly improved the quality of image captioning by introducing a self-attention mechanism,and has rapidly developed into the mainstream method of image captioning.This thesis proposes an image captioning by multi-feature fusion based on the Transformer model.Aiming at the problem that the attention mechanism in the standard Transformer model does not make full use of the spatial relationship between objects in the image,especially the relative direction relationship between objects,this thesis proposes a spatial relationship encoding strategy,that is,generating an absolute position matrix and a relative position matrix according to the position information and a relative direction matrix is added on this basis,The spatial relationship between image objects is expressed by fusing these three types of spatial features.Aiming at the problem that the regional features of the image are difficult to represent the global context information of the image,this thesis designs a cross-attention mechanism that fuses grid feature and region feature,making full use of the global representation ability of grid feature to image content,effectively making up for the lack of contextual information in region feature and capturing more fine-grained representation of image features.At the same time,this thesis designs a Fusion Gate Operation to control the interaction between grid feature and region feature,thus effectively guiding the model to generate high-quality caption.This thesis conducts experiments on MS-COCO,a typical dataset for image captioning,and compares and analyzes this method and existing representative image captioning methods based on mainstream evaluation metric.The experimental results show that the performance of the image captioning method combining grid feature and region feature proposed in this thesis is better than other models,and reaching 133.4% on the CIDEr;Complete ablation experiments also verify the effectiveness of the improvement method proposed in this thesis.
Keywords/Search Tags:Image Captioning, Transformer, Spatial Relationship Encoding, Grid Feature, Region Feature
PDF Full Text Request
Related items