| Image caption refers to the process of automatically converting an image into a text description,which contains scenes,objects,actions,and other key information.Inspired by machine translation,the current mainstream of this task adopts the encoder-decoder structure based on deep learning.The encoder stage uses convolutional neural network to extract image features,and the decoder stage uses long-term short-term memory network to extract semantic features,and finally fuses visual information and semantic information through visual attention to output sentences.However,in the existing encoder-decoder algorithm structure,the decoder only generates words one by one from front to back,and cannot analyze the complete context information.The prediction and generation of a word and sentence not only needs to pay attention to the above information,but also should pay attention to the following information.First,in response to the above problems,it designs two different image caption generation structures d-LSTM(based on parallel LSTM transmission structure)and Bi-LSTM-s(bidirectional LSTM structure based on auxiliary attention mechanism).Firstly,both structures use CNN as an encoder to extract image visual information through the Resnet-101 network.Secondly,d-LSTM adopts a double-layer parallel LSTM structure as a decoder,and transfers the updated semantic information of the first layer LSTM to the second layer LSTM structure.After forgetting and updating,the two-layer semantic information is finally fused and output.The semantic information of the current time series comes not only from the serial LSTM structure but also from the parallel time series semantic information,so as to achieve context-based image title generation.The Bi-LSTM-s structure uses the Bi-LSTM structure as the decoder,and the visual information is respectively input into the forward LSTM decoder F-LSTM(forward LSTM decoder),B-LSTM(backward LSTM decoder)to extract semantic information,and the The obtained semantic information is fused and complementary semantic output.However,the sequential difference in forward and backward semantic information leads to poor output results.Secondly,in response to this problem,This paper proposes a subsidiary attention mechanism(s-att)to act between F-LSTM and B-LSTM,and pass the semantic information of B-LSTM and F-LSTM through s-att extracts the similarity,and extracts the semantic interaction area according to the similarity,aligns the hidden states,and outputs complementary.However,the single auxiliary attention mechanism only uses semantic information as the weight,and loses the salient area information of the picture.Finally,for this problem,this paper integrates the visual attention mechanism and the subsidiary attention mechanism,and the semantic progressive output.In this paper,multiple sets of comparative experiments are set up: verifying the cross-entropy error function and reinforcement learning training performance,comparing the d-LSTM and Bi-LSTM-s models,comparing with multiple advanced models,and comparing the results under different fusion double-attention mechanism strategies,Hyperparameter selection of the auxiliary attention mechanism.The experimental data shows that this paper verifies the superiority of the two models on the MSCOCO dataset.The scores of 35.6 and37.5 were respectively obtained on the BLEU-4 evaluation index,and the information of the context was effectively extracted and fine-grained information was generated. |