Font Size: a A A

Text-to-image Generation Based On Feature Alignment And Fusion

Posted on:2023-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y YangFull Text:PDF
GTID:2568306902958839Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
In recent years,text-to-image generation has become one of the most important research directions in Computer Vision.It aims to understand the semantic information of text and generate relevant images.Generating specified images helps to expand the dataset for DeepFake Detection,and provides support for network public opinion analysis with the text.The traditional methods for text-to-image generation try to map text features to image features directly.Due to the large gap between text and image,they suffer from poor performance.Recently,the generative adversarial network(GAN)has significantly improved the quality of generated images.But there are still many problems.For example,methods with multiple generator-discriminator pairs causes model redundancy and makes the generated images appear to be composed of target objects,lacking visual realism.While the single-stage methods don’t make full use of the text features.Both of the two kinds of methods lack of the unmatched real image information,and don’t perform local pattern matching between image and text.Consequently,the semantic consistency between image and text cannot be remain.To solve these problems,this thesis proposes two improved single-stage text-to-image generation models as following:(1)We propose a text-to-image generation model based on the feature fusion(MFGAN).This method adopts the pairwise generator-discriminator as the backbone and exploits both coarse-grained and fine-grained textual information with a conditional residual module and a dual attention module.Specifically,sentences and words are repeatedly fed into two modules for deep fusion of text and image features.In addition,we introduce triplet loss to narrow the visual gap between the generated images and the matched real images,and meanwhile widen the gap between the generated images and the unmatched real image to utilize the unmatched real image.Experimental results demonstrate that MF-GAN outperforms most state-of-the-art methods.(2)We propose a text-to-image generation model based on feature alignment(MFA-GAN).This method is an extension of MF-GAN,achieving local semantic alignment between text and image through a cross-modal attention mechanism.Specifically,the cross-modal attention mechanism includes two feature alignment directions,i.e.,text-to-image and image-to-text.After calculating the local matching similarity in this two directions separately,a triplet loss is applied to obtain the final semantic alignment loss,thereby further improving the semantic consistency between text and images.Experiments confirm that the generation performance of MFA-GAN is better than MFGAN.
Keywords/Search Tags:Text-to-image generation, Generative Adversarial Network(GAN), Triplet loss, Semantic alignment, Cross-modal attention
PDF Full Text Request
Related items