MLFormer:Multi-Layer Perceptive Transformer For Image Captioning

Posted on:2024-07-11

Degree:Master

Type:Thesis

Country:China

Candidate:X X Wang

Full Text:PDF

GTID:2568307157983249

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Image captioning is an important task in the field of artificial intelligence,and its motivation is to make computers capable of "seeing and talking".Currently,this technology is widely used in tasks such as image-text retrieval and daily life assistance for people with disabilities,and it has a wide range of applications in vision robots and various vision intelligence services in the future.In recent years,the Transformer network based on the self-attention mechanism has achieved excellent performance in image captioning by virtue of its powerful ability to model long-range dependencies.However,the complex structural design and low computational efficiency of the Transformer model have seriously hindered its practical application in image description production tasks.Recently,Multi-Layer Perceptron(MLP)model has achieved excellent performance in balancing performance and computational cost.To address these issues,this paper proposes the following three methods for generating image captioning based on the Multi-Layer Perceptron model to achieve an effective balance between performance and computational cost.1)To address the problem that introducing relative position encoding brings additional parameters and computational overhead when using grid-like features of images directly as input to the Transformer model,the LG-MLFormer(Local and Global Multi-Layer Perceptive Transformer)model is proposed,and the The LG-MLP(Local-Global MLP)module is designed as the core operator of the LG-MLFormer model.The LG-MLP consists of two independent LM(Local-MLP)modules and one CDGM(Cross-Domain Global MLP)module.The LM model block is specifically designed to map the dimensionality between linear layers to The CDGM module effectively captures the potential association between grid-like features and regional features of the image through a nonlinear gating mechanism.Experimental results show that the proposed LG-MLFormer model can generate more accurate image text descriptions than the current mainstream image description methods.2)To address the problem that the self-attention mechanism ignores the potential correlations between different images and fails to capture the information of internal and external relationships between different features,an Internal-External Multi-Layer Perceptual Transformer(IE-MLFormer).In the encoding stage,the proposed internal-external joint control mechanism and regularized multi-layer perceptron module design a new internal-external multi-layer perceptrons(IE-MLP)encoder,which makes the captured visual semantic information richer.The experimental results show that the method achieves advanced performance based on region features compared with the mainstream methods.The overall performance is improved by 8.3% compared to the baseline model.3)To address the structural complexity and computational inefficiency of the Transformer model,a encoder-decoder model CMANet(Cross-Modal Adaptive Network)based entirely on the MLP module is proposed.CMANet learns potential semantic associations between visual and textual features through a linear mapping based on dynamic weights and uses two-way decoding structure to exploit the image-to-text and text-to-image relationships.This architecture significantly improves the ability to model cross-modal relationships and language generation.The results on the MS-COCO dataset show that CMANet outperforms the standard Transformer network while having a smaller number of parameters and faster inference.

Keywords/Search Tags:

image captioning, multi-layer perceptron, transformer, encoder-decoder framework

PDF Full Text Request

Related items

1	Research On Image Captioning Based On Self-Attention And Encoder-Decoder
2	Research On Image Captioning Algorithm Based On Deep Learning
3	Research On Image Semantic Caption Generation Based On Encoder-Decoder Framework
4	Research On Image Captioning Algorithm Based On Attention Mechanism
5	Research On Video Captioning Methods Based On Encoder-decoder Structure
6	Image Captioning Based On Deep Recurrent Convlution Network And Spatio-temporal Information Fusion
7	A New Image Captioning Algorithm
8	Research On Computer Vision Image Captioning Based On Deep Learning
9	Research On Image Captioning Methods Based On Deep Learning
10	Research On Image Captioning Algorithm Based On Encoding And Decoding