Font Size: a A A

Image To Language:Auto Image Captioning Using Bi-directional LSTM And Deep Attention Neural Networks

Posted on:2023-05-01Degree:DoctorType:Dissertation
Institution:UniversityCandidate:Rashid KhanFull Text:PDF
GTID:1528306905481374Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Image Captioning(IC)is the way of forming a description for an image.Image captioning is a challenging problem since it requires an understanding of important objects,attributes,and relationships in an image.It also includes generating syntactically and semantically meaningful natural language descriptions of the images.Artificial intelligence is still attempting to reproduce this trait,which comes effortlessly to humans.It is a challenging task as it uses both Computer Vision(CV)and Natural Language Processing(NLP)related fields to generate the captions.Advances in this research study might significantly impact the development of Artificial intelligence(AI)technologies able to understand and communicate their perception of the visual world in the most natural way possible.An image encoder and a language decoder are two components of a typical image captioning pipeline.The encoder is usually composed of Convolutional Neural Networks(CNNs),whereas the decoder is usually Long Short Term Memory(LSTM)networks.To create relevant and accurate captions,a range of LSTMs and CNNs,including attention mechanisms,are applied.Despite previous research on image captioning approaches,they are still trailing in terms of achieving significant performance levels for diverse datasets.This research study aims to develop an advanced model for automatic image caption generation based on deep neural networks.First,an optimized Bi-directional Long Short Term Memory(BLSTM)model is established for precise image captioning and classification.The Inception v3 model is deployed for extracting the features,where the training happens under a metaheuristic model by appropriate tuning of epochs.We proposed a variant of Moth Flame Optimization(MFO),termed Proposed Moth Flame Optimization(PMFO),which has a logarithmic spiral update based on correlation.The performance of the proposed model is demonstrated on benchmark datasets like Flicker8k,Flicker30k,VizWiz,and MS COCO datasets using renowned metrics such as CIDEr,BLEU,SPICE,and ROUGH.The performance analysis proves that the BLSTM achieves better performance in caption generation than baseline approaches.We proposed our second research work to a particular approach for completing a capture generation task using an attention-based sequence-to-sequence framework that,when combined with a conventional encoder-decoder model,generates captions in an attention-based manner.ResNet-152 is a CNN-based encoder that generates a comprehensive representation of an input image while embedding it into a fixed-length vector of a fixed size.To predict the next sentence,the decoder uses LSTM and an attention-based mechanism to concentrate attention on certain sections of an image selectively.Set the set of epochs to 69,which should be enough for training the model to generate informative descriptions,and the validation loss has reached its minimum limit and no longer decreases.Experiments on the MS COCO and Flickr8k benchmark datasets illustrate the model’s efficacy compared to the baseline techniques.Our third proposed work aims to develop a system that uses a pre-trained Convolutional Neural Network(CNN)to extract features from an image,integrates the features with an attention mechanism,and creates captions using an RNN.To encode an image into a feature vector as graphical attributes,we employed multiple pre-trained convolutional neural networks.A language model known as Gated Recurrent Unit(GRU)is chosen as the decoder to construct the descriptive sentence.To increase performance,we merge the Bahdanau attention model with GRU to allow learning to be focused on a specific portion of the image.On the MS COCO dataset,the experimental results achieve competitive performance against state-of-the-art approaches.We propose three novel frameworks for image captioning overall.This thesis can facilitate the connection between computer vision and natural language generation,as well as extend the generation into specific fields.
Keywords/Search Tags:Image Captioning, Computer Vision, Natural Language Processing, Bi-directional Long Short Term Memory, Proposed Moth Flame Optimization, Gated Recurrent Unit, Attention Mechanism, Inception V3, Resnet-152
PDF Full Text Request
Related items