| Multimodal generation is a generating task of representing,aligning,reasoning,integrating and translating cross multiple modalities.It is a hot research direction with great scientific and application value in the field of artificial intelligence,and has drawn attention from several communities such as computer vision and natural language processing in recent years.As a typical multimodal generation task,visual storytelling is very challenging but under-explored.It requires the model to generate a story text that not only well-stated and logically consistent but also related to the given images.In previous work,a general approach is using a pre-trained convolutional neural network as an image feature representation extractor.Then the image features are fed into a recurrent neural network to generate a story.This type of paradigm has some shortcomings.Firstly,the approach of extracting image feature representations with convolutional neural networks may limit the discovery of cross-domain image semantic information.Because convolutional neural networks do not be fine-tuned in the target domain,and then the extracted image features may lack high-level semantic information.Secondly,recurrent neural network uses a gate mechanism to choose the information,which is not applicable to the fusion of image information and text information.To address the above shortcomings of the traditional paradigm,a pure transformer-based paradigm is proposed for the task in this thesis.To explore the deep semantic information of images,a vision transformer is adopted to replace convolutional neural network as image feature extraction.This module can directly participate in the training of the whole model and fine-tune the purpose by constantly adjusting its own weights,which helps to obtain more effective image feature representation.Meanwhile,a Sequence-to-Sequence-based Vision Transformer(SSVT)model and a conditional Variational Autoencoder-based Vision Transformer(VAVT)model are proposed,respectively.Notably,the VAVT model takes the image representation as the prior network input and the joint image text representation as the posterior network input,and explicitly fuses the two modalities by pulling the two network output distributions closer by KL divergence.The effectiveness of the proposed model is verified through comparative experiments,ablation experiments,case study and visualization analysis.This research may provide a reference for transformer in other multimodal generation tasks. |