Image Captioning By Multi-feature Fusion

Posted on:2023-06-06

Degree:Master

Type:Thesis

Country:China

Candidate:Y Li

Full Text:PDF

GTID:2568306830981279

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Image captioning is a cross-domain problem involving computer vision and natural language processing.It has been extensively studied in recent years,forming a series of typical methods based on the encoder-decoder framework.The image captioning method based on the Transformer model has greatly improved the quality of image captioning by introducing a self-attention mechanism,and has rapidly developed into the mainstream method of image captioning.This thesis proposes an image captioning by multi-feature fusion based on the Transformer model.Aiming at the problem that the attention mechanism in the standard Transformer model does not make full use of the spatial relationship between objects in the image,especially the relative direction relationship between objects,this thesis proposes a spatial relationship encoding strategy,that is,generating an absolute position matrix and a relative position matrix according to the position information and a relative direction matrix is added on this basis,The spatial relationship between image objects is expressed by fusing these three types of spatial features.Aiming at the problem that the regional features of the image are difficult to represent the global context information of the image,this thesis designs a cross-attention mechanism that fuses grid feature and region feature,making full use of the global representation ability of grid feature to image content,effectively making up for the lack of contextual information in region feature and capturing more fine-grained representation of image features.At the same time,this thesis designs a Fusion Gate Operation to control the interaction between grid feature and region feature,thus effectively guiding the model to generate high-quality caption.This thesis conducts experiments on MS-COCO,a typical dataset for image captioning,and compares and analyzes this method and existing representative image captioning methods based on mainstream evaluation metric.The experimental results show that the performance of the image captioning method combining grid feature and region feature proposed in this thesis is better than other models,and reaching 133.4% on the CIDEr;Complete ablation experiments also verify the effectiveness of the improvement method proposed in this thesis.

Keywords/Search Tags:

Image Captioning, Transformer, Spatial Relationship Encoding, Grid Feature, Region Feature

PDF Full Text Request

Related items

1	Image Feature Understanding And Semantic Representation Based On Deep Learning
2	Research On Structured Feature Representation In Images
3	Learning Positional Relationship For Image Captioning
4	Study On Three-dimensional Spatial Relationship Based On Single Image
5	Research On Image Captioning Based On Image Feature Fusion
6	Visual Feature Representation Based On Detailed Spatial Relationship Information For Image Classification
7	A Study On The Spatial-spectral Feature Description Algorithm Of Hyperspectral Image Based On Tensor
8	Image Captioning Based On Self-Attention Network
9	Study On Image Captioning Based On Spatial Topological Relationship
10	MLFormer:Multi-Layer Perceptive Transformer For Image Captioning