| Image captioning is an important task in the field of artificial intelligence,and its motivation is to make computers capable of "seeing and talking".Currently,this technology is widely used in tasks such as image-text retrieval and daily life assistance for people with disabilities,and it has a wide range of applications in vision robots and various vision intelligence services in the future.In recent years,the Transformer network based on the self-attention mechanism has achieved excellent performance in image captioning by virtue of its powerful ability to model long-range dependencies.However,the complex structural design and low computational efficiency of the Transformer model have seriously hindered its practical application in image description production tasks.Recently,Multi-Layer Perceptron(MLP)model has achieved excellent performance in balancing performance and computational cost.To address these issues,this paper proposes the following three methods for generating image captioning based on the Multi-Layer Perceptron model to achieve an effective balance between performance and computational cost.1)To address the problem that introducing relative position encoding brings additional parameters and computational overhead when using grid-like features of images directly as input to the Transformer model,the LG-MLFormer(Local and Global Multi-Layer Perceptive Transformer)model is proposed,and the The LG-MLP(Local-Global MLP)module is designed as the core operator of the LG-MLFormer model.The LG-MLP consists of two independent LM(Local-MLP)modules and one CDGM(Cross-Domain Global MLP)module.The LM model block is specifically designed to map the dimensionality between linear layers to The CDGM module effectively captures the potential association between grid-like features and regional features of the image through a nonlinear gating mechanism.Experimental results show that the proposed LG-MLFormer model can generate more accurate image text descriptions than the current mainstream image description methods.2)To address the problem that the self-attention mechanism ignores the potential correlations between different images and fails to capture the information of internal and external relationships between different features,an Internal-External Multi-Layer Perceptual Transformer(IE-MLFormer).In the encoding stage,the proposed internal-external joint control mechanism and regularized multi-layer perceptron module design a new internal-external multi-layer perceptrons(IE-MLP)encoder,which makes the captured visual semantic information richer.The experimental results show that the method achieves advanced performance based on region features compared with the mainstream methods.The overall performance is improved by 8.3% compared to the baseline model.3)To address the structural complexity and computational inefficiency of the Transformer model,a encoder-decoder model CMANet(Cross-Modal Adaptive Network)based entirely on the MLP module is proposed.CMANet learns potential semantic associations between visual and textual features through a linear mapping based on dynamic weights and uses two-way decoding structure to exploit the image-to-text and text-to-image relationships.This architecture significantly improves the ability to model cross-modal relationships and language generation.The results on the MS-COCO dataset show that CMANet outperforms the standard Transformer network while having a smaller number of parameters and faster inference. |