Font Size: a A A

Research On Diverse Image Captioning Method Based On Variational Autoencoder

Posted on:2024-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:J XuFull Text:PDF
GTID:2568307118978399Subject:Software Engineering Technology
Abstract/Summary:PDF Full Text Request
With the development of computer vision and natural language processing,image captioning has become a highly researched area.Traditional image captioning methods aim to generate a sentence closely related to the semantic content of an image,but their emphasis on accuracy at the expense of diversity leads to overly simplistic and repetitive captions that do not match the diversity of human language.To address this issue,researchers have increasingly focused on generating diverse image captions that preserve accuracy while producing multiple descriptions with varying vocabulary and syntax.However,existing methods lack discriminability in diverse captions and are limited in their controllability and interpretability,severely hindering their practical applications.This thesis explores and proposes relevant methods for diversifying image captions based on the mainstream encoder-decoder architecture,with the following main research and application outcomes:(1)To address the problem of insufficient discriminability in traditional diverse caption methods,a new double-stream conditional variational autoencoder method is proposed.This method combines sequence variational autoencoder with dual contrastive learning to enhance the model’s ability to generate diverse and accurate descriptions.Specifically,in the encoding phase,this method constructs a double-strean conditional variational autoencoder to learn a pair of description latent spaces and introduces contrastive learning in the sequence latent space.In the decoding phase,it uses descriptions sampled from a pre-trained LSTM decoder as negative samples and compares them with positive samples obtained using greedy sampling.This method not only explores the discriminability between paired and unpaired image-captions,but also suppresses the generation of common words caused by cross-entropy loss,significantly improving the discriminability and diversity of generated descriptions while ensuring their quality.(2)To address the limitation of existing Transformer-based image captioning models that are restricted to a single mapping between images and descriptions,a conditional variational Transformer-based image captioning model is proposed.This model combines Transformer with variational autoencoding and uses Swin Transformer as an image feature extractor for end-to-end training.Based on this,the image features and text captions are encoded into a global latent variable,and the KL loss is used to measure the distance between the conditional posterior and prior distributions.With the powerful encoding abilities of Transformer and the conditional variational lower bound optimization method,the proposed model can generate more diverse combinations of words and phrases.Compared with traditional diverse description methods,the proposed framework achieves a better balance between the accuracy and diversity of image captions.(3)To address the problem of insufficient controllability in generating diverse image captions,a controllable image captioning model is proposed.The core idea is to deconstruct the grammar structure of image captions and word sequence generation,thereby improving the interpretability and controllability of the image caption generation process.Specifically,a diversified part-of-speech sequence generation model is constructed based on conditional variational autoencoders to map images to diversified part-of-speech sequences.Then,the part-of-speech sequence is used as a control signal to guide the decoder in generating image captions,enabling control over the granularity of the generated captions.Additionally,this method not only generates accurate and diverse captions for given images but also allows customizing the generation of captions through part-of-speech sequences.To validate the effectiveness of the proposed method,sufficient quantitative and qualitative experiments were conducted on the widely used MSCOCO dataset,and a fair and comprehensive comparison was made with classic methods for generating diverse image descriptions.The experimental results demonstrate the superiority of the proposed method.
Keywords/Search Tags:Diverse Image Captioning, Variational Autoencoder, Contrastive Learning, Part of Speech Tagging, Transformer
PDF Full Text Request
Related items