Font Size: a A A

Research On Image Caption Method Based On Multi-Feature Fusion And Visual Semantic Adaptation

Posted on:2024-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:J D LiFull Text:PDF
GTID:2558307127460934Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The image caption task takes the image as the input,and makes the computer output the natural language description text of the corresponding image through mathematical model and calculation,so that the computer has the ability to "look at the picture and speak".As a cross modal task of computer vision and natural language processing,image caption task has attracted extensive attention of scholars at home and abroad.This paper extends the classical Transformer architecture,with improvements on the coding side focusing on exploring efficient and accurate image feature representation,and on the decoding side ensuring that the model generates accurate and fluent utterances through an attention approach.The main research of this paper is as follows:(1)A multi-modal fusion method based on hierarchical enhancement is proposed for optimizing the visual representation of images,addressing the problems of neglected non-target regions of images and loss of fine-grained information about objects caused by most methods using only a single object feature.A novel attention operation is proposed in the encoder,which explores the complementarity of text features with region and grid features,which are then adaptively fused by a gating mechanism to obtain a comprehensive image representation.In addition to this,the multi-layer encoder captures the different representation spaces of the image,modelling out vectors of the image more complex as the number of stacked layers increases.This paper attempts to improve the original encoding-decoding cross-attention in the decoding layer by splicing a layer-integrated vector in a visual vector providing keys and values as enhanced visual features.Experimental results online and offline achieved similar and excellent scores,which further demonstrates the strong robustness of the model.(2)A visual interaction and adaptive selection based image description method is proposed for solving the local optimum problem arising from equal treatment of visual and semantic words in the image decoding process.The current mainstream Transformer model-based approach starts by subjecting the image to target detection and extracting features of the image for delivery to the encoder side,followed by a step-by-step process of generating descriptions guided by the image.The description utterance contains both visual and non-visual words,and it is not reasonable to predict them indiscriminately at decoding time,thus an adaptive attention module is added to perform multi-modal inference for word generation prior to decoding prediction.In addition to this,the interaction of visual and grid features is further explored so that each region adaptively captures each segment feature in the grid to achieve complementary visual information from two different sources,further increasing the granularity of description from a multi-feature fusion perspective.Experiments show that the method achieves a score of 131.6 on the evaluation metric CIDEr respectively,achieving a 0.3% improvement compared to the advanced model M2.Based on the experimental results obtained on the MS COCO dataset,it is known that the model proposed in this paper successfully generates semantically rich and fluent descriptions and achieves highly competitive performance with state-of-the-art models on the evaluation metrics BLEU{1,2,3,4},METEOR,ROUGE-L and CIDEr,both in online and offline tests.
Keywords/Search Tags:Deep learning, Image caption, Transformer, Multi-modal feature fusion
PDF Full Text Request
Related items