Visual Storytelling Based On Vision Transformer

Posted on:2023-08-05

Degree:Master

Type:Thesis

Country:China

Candidate:L Z Mo

Full Text:PDF

GTID:2568306794982149

Subject:Control Science and Engineering

Abstract/Summary:

PDF Full Text Request

Multimodal generation is a generating task of representing,aligning,reasoning,integrating and translating cross multiple modalities.It is a hot research direction with great scientific and application value in the field of artificial intelligence,and has drawn attention from several communities such as computer vision and natural language processing in recent years.As a typical multimodal generation task,visual storytelling is very challenging but under-explored.It requires the model to generate a story text that not only well-stated and logically consistent but also related to the given images.In previous work,a general approach is using a pre-trained convolutional neural network as an image feature representation extractor.Then the image features are fed into a recurrent neural network to generate a story.This type of paradigm has some shortcomings.Firstly,the approach of extracting image feature representations with convolutional neural networks may limit the discovery of cross-domain image semantic information.Because convolutional neural networks do not be fine-tuned in the target domain,and then the extracted image features may lack high-level semantic information.Secondly,recurrent neural network uses a gate mechanism to choose the information,which is not applicable to the fusion of image information and text information.To address the above shortcomings of the traditional paradigm,a pure transformer-based paradigm is proposed for the task in this thesis.To explore the deep semantic information of images,a vision transformer is adopted to replace convolutional neural network as image feature extraction.This module can directly participate in the training of the whole model and fine-tune the purpose by constantly adjusting its own weights,which helps to obtain more effective image feature representation.Meanwhile,a Sequence-to-Sequence-based Vision Transformer(SSVT)model and a conditional Variational Autoencoder-based Vision Transformer(VAVT)model are proposed,respectively.Notably,the VAVT model takes the image representation as the prior network input and the joint image text representation as the posterior network input,and explicitly fuses the two modalities by pulling the two network output distributions closer by KL divergence.The effectiveness of the proposed model is verified through comparative experiments,ablation experiments,case study and visualization analysis.This research may provide a reference for transformer in other multimodal generation tasks.

Keywords/Search Tags:

Visual Storytelling, Multimodal, Vision transformer, Variational autoencoder, Attention mechanism

PDF Full Text Request

Related items

1	Research On Low-light Image Enhancement Based On Variational Autoencoder
2	Multimodal Scene Classification Based On Audio Image Collaboration
3	Research On Image Generation Algorithm Based On Variational Auto-Encoder
4	Deep Probabilistic Generative Models Based On Multimodal Variational Inference
5	Research On Multimodal Deep Learning Algorithm Based On Attention Mechanism
6	Research On Multimodal Interactive Information Fusion Method Based On Transformer
7	Transformer Based On Audio-Visual Event Localization
8	Research On Image Semantic Understanding Based On Attention Mechanism
9	Research On Diverse Image Captioning Method Based On Variational Autoencoder
10	Research On Multimodal Interaction Model And Optimization Method For Visual Question Answerin