Font Size: a A A

Image Captioning Based On Mutual-aid Bidirectional LSTM And Progressive Decoding Mechanism

Posted on:2019-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:Z H YanFull Text:PDF
GTID:2428330596461604Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Image captioning has been a research hotspot in the intersection of computer vision and natural language processing.It requires the algorithm to accurately recognize image contents and express them as a fluent sentence.The purpose of this paper is to investigate the shortcomings of existing algorithms and propose two improvements: Mutual-aid Bidirectional Long Short-Term Memory(MB-LSTM)and Progressive Decoding Mechanism(PDM).Inspired by Convolutional Neural Network(CNN)and Long Short-Term Memory(LSTM)successfully applied to image recognition and machine translation,mainstream image captioning algorithms employ CNN to encode an image as image feature and LSTM to decode image feature into a sentence.However,such generating words one by one from front to back doesn't take into account the influence of succeeding words on the whole sentence generation.The contextual relation among words suggests existing algorithms should be further refined to take full advantage of the context.This paper proposes Mutual-aid Bidirectional LSTM(MB-LSTM)consisting of a forward LSTM,a forward aid network,a backward LSTM,and a backward aid network.In the training stage,the forward and backward LSTM encode the preceding and succeeding words into their respective hidden states,which are then fed into the forward and backward aid networks respectively to predict each other's hidden states.The mechanism of mutually predicting hidden states enables each LSTM to comprehend the complementary context.After training,each LSTM can generate captions in a self-contained manner by implicitly making full use of the context in their respective hidden states.This paper conducts comparative experiments on Microsoft COCO dataset,using seven commonly used evaluation metrics to verify the effectiveness of MB-LSTM.Another problem with existing algorithms is that the generated word sequence is directly adopted as the image caption as soon as finishing word generation.However,this one-pass drafting method is not in line with human writing habits.After writing a sentence,humans often smooth it for more fluency.Based on this common sense,we draw on the Deliberation Network for machine translation and introduce Progressive Decoding Mechanism(PDM)in image captioning task.PDM follows the re-editing formula and refines the pre-generated image captions.First,use MB-LSTM to pre-generate sentences and convert the words in sentences into embedding vectors,namely,the text features.Second,combine the text features and image features from CNN into multimodal features.Third,employ LSTM with multimodal attention mechanism to generate final sentences based on multimodal features.This paper uses the Microsoft COCO dataset to verify the effectiveness of PDM and demonstrates the role of PDM through multimodal attention visualization.
Keywords/Search Tags:Image Captioning, Mutual-aid Bidirectional LSTM, Progressive Decoding Mechanism
PDF Full Text Request
Related items