Font Size: a A A

Image Captioning Based On Deep Recurrent Convlution Network And Spatio-temporal Information Fusion

Posted on:2020-06-05Degree:MasterType:Thesis
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:2428330590496485Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Image captioning aims to generate a natural language description for a given image.As an emerging research topic in the field of artificial intelligence,image captioning has attracted more and more attention.Not only image captioning identifies the objects,their attributes and their relationship in the image,but also generates grammatically and semantically correct sentences.Therefore,there are two basic problems in image captioning: visual understanding and language processing.To solve these issues,computer vision and natural language processing technology should be used,which greatly increases the challenge of image captioning.State-of-art image captioning models today are based on deep learning algorithms.Firstly,Convolutional Neural Network is used as an encoder to extract image features,and then the Recurrent Neural Network is used as a decoder to generate descriptions.However,the existing image captioning methods cannot fully utilize the spatial information of the image,and also ignores the fusion between the image space information and the time information of the text sequence.In order to solve the above problems,this dissertation designs three image captioning methods based on the encoder-decoder framework and attention mechanism.The main research contents of this dissertation are as follows:1.The algorithm named Image Captioning with Deep Recurrent Convlution Network(DRCN)is designed.The algorithm first uses the CNN to extract image features,and then uses Convolutional LSTM to learn and remember the three-dimensional image feature maps.Finally,the output of the ConvLSTM is used as the input of Long Short-Term Memory,which guides the language generation model to generate words at different times.Compared with the traditional image captioning algorithms based on CNN-LSTM,the sentences generated by DRCN contain more semantic information in the images.2.In order to make full use of the spatial information of the image,the algorithm named Spatial Attention for Generating Image Descriptions(Sp-Attention)is designed.First,the algorithm uses a CNN as the encoder.Based on the words generated at the previous time step,the three-dimensional feature maps of the convolutional layer is weighted,which can preserve the spatial information of the image to the greatest extent.The map is transformed into a context vector and then input into the language generation model,so that the language generation model learns the image regions corresponding to the words at different times.Finally,the context vector obtained by the spatial attention maps transformation is input into the language generation model,so that the language generation model learns the image region corresponding to the word at different times.Compared with the previous image captioning algorithms based on visual attention mechanism,the sentences generated by Sp-Attention contain more detailed information of the image,which is more consistent with the image content.3.In order to combine image space information with time series information,the algorithm named Image Captioning with Deep Recurrent Convolution Network and Spatial Attention(DRCN-SA)is designed.First,the algorithm uses ConvLSTM to learn and remember the obtained CNN features,then add the spatial attention layer after the output of ConvLSTM,and finally the context vector obtained by the attention layer is used to control the LSTM to generate words.The combination of the ConvLSTM layer and the spatial attention layer combines the spatial information of the images with the time series information of the sentences.The deepened network allows the model to learn more image and text information,making the description closer to the labels in databases.In addition,a comparison algorithm named Image Captioning with Spatial Attention and Deep Recurrent Convolution Network(SA-DRCN)is designed to prove the rationality and effectiveness of SA-DRCN.
Keywords/Search Tags:image captioning, deep learning, deep neural networks, encoder-decoder framework, attention mechanism
PDF Full Text Request
Related items