| As a crossing domain of computer vision and natural language processing,the image caption generation has been an active research topic in recent years,which contributes to the multimodal social media translation from unstructured image data to structured text data.The conventional research works have proposed a series of image captioning methods,such as template-based,retrieval-based,encode-decode.Among these methods,the one with encode-decode framework is widely used in the image caption generation,in which the encoder extracts the image features by Convolutional Neural Network(CNN),and the decoder adopts Recurrent Neural Network(RNN)to generate the image description.The Neural image caption(NIC)model has achieved good performance in image captioning,and,however,there still remain some challenges to be addressed.To tackle the challenges of the lack of image information and the deviation from the core content of the image,our proposed model explores visual attention to deepen the understanding of the image,adopts textual attention to enhance the integrity of information,and puts forward the dual attention mechanism combined with visual attention and textual attention to guide the image caption generation.To address the problem of the generated sentences deviating from the core content of the image,based on the NIC model,the encoder utilizes the Inception_v4 network to extract the image features,while the decoder introduces the visual attention mechanism to add to the Long Short-Term Memory(LSTM)network.To tackle the problem of the lack of image information in the generated sentence descriptions,the textual attention mechanism is proposed to enhance the information integrity of the generated sentence descriptions.This thesis tries to extract labels based on Fully Convolutional Network(FCN)and Non-negative Matrix Factorization(NNF)topic model and adopts the dual attention mechanism to guide the image caption generation combined with textual attention attached to the image labels and visual attention focusing on image regions.The effects of different positions of visual attention and text attention on the results of the image caption are also explored.The experiments have been conducted on the AIC-ICC dataset.And the result of the image caption generation of the NICNDA model based on the dual attention mechanism is better than the benchmark model and the models based on a single attention mechanism,which shows that the proposed NICNDA model based on the dual attention mechanism is feasible.Moreover,the results of the image caption generation based on the combination of the dual attention mechanism also show that the research on the combination of the dual attention mechanism is meaningful and effective in this thesis. |