Font Size: a A A

Image Generation From Scene Graphs Based On Transformer

Posted on:2024-05-29Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhaoFull Text:PDF
GTID:2568306917997069Subject:artificial intelligence
Abstract/Summary:PDF Full Text Request
Image generation from scene graphs has traditionally focused on predicting layout from the scene graph using graph convolutional networks first,then converting the layout to an image,which is a type of AIGC(Artificial Intelligence-Generated Content)task.The scene graph here refers to the semantic scene graph,which consists of multiple objects and relationships.As a structured data type,it has both the characteristics of natural language and is suitable for depicting images.It is a good intermediary between natural language and images,which benefits cross-modal research about text and images.With a simple interface,users can easily describe the required image in the format of the semantic scene graph,and then generate the image by computer automatically,which is a great significance for the computer-aided design.For example,it can be used for interior design,draws scene graphs according to the decoration requirements of users,and then directly generates the final view after the decoration is completed;it can be used to help children learn the correspondence between language and images;and make image editing software such as Visio and Photoshop more intelligent,making image editing and image production faster and more convenient.At present,the research on this task mainly relies on deep learning method.The overall pipeline can be divided into two steps.The first step is extracting the object’s features by graph convolutional neural network and predicting the layout by regression network.The second step is to generate the final image from the layout through the CNN network.The main loss function is L1 loss,GAN loss,and condition GAN loss.The layout as a medium leads to some problems.The scene graph is not entirely equivalent to the layout,so some context information will be lost in the process of converting the scene graph to the layout.At the same time,it is difficult to generate high-quality images based on the CNN network and GAN losses.Aiming at the above problems in existing methods,we decide to improve the overall framework of the model and the image generation paradigm.According to the current development trend of image generation,we propose to use Transformer as a base module to build our backbone network.The main research content in this thesis includes the following two parts:(1)We do experiments to investigate the image discrete autoencoder suitable for this task,which encodes the image as a discrete token,compresses the image information,and then uses the decoder to restore it as the original image.This study aims to prepare for the final transformer model training.Using image pixels directly as image token to train Transformer will require much memory,increase calculation complexity,and even cause training process failure to converge.(2)We propose a method for image generation from scene graphs based on Transformer,which can predict image tokens end-to-end in autoregressive way without additional layout annotations.Furthermore,we propose a method to convert scene graphs into tokens.The experimental results on the COCO-Stuff and Visual Genome datasets show that our method significantly outperforms the state-of-the-art methods in terms of image quality.
Keywords/Search Tags:Image generation, Semantic scene graph, Autoregressive, Transformer
PDF Full Text Request
Related items