| In the era of mobile internet,images have become the main medium for information dissemination in human society.However,image creation is complex and inefficient.Text-to-image synthesis aims to synthesize target images that are semantically consistent with natural language descriptions,making image creation more efficient.Due to its extensive practical value,text-to-image synthesis has become an important topic in the field of multimodal machine learning.Generative Adversarial Networks have achieved good results in text-to-image synthesis tasks,but there are also many issues.First,traditional multi-stage stacked models have parameter entanglement between each stage,which makes training difficult and the results are limited by the quality of the initial stage synthesized image.Second,recent single-stage models have avoided the above problems,but the utilization of finegrained text features is insufficient.Third,current models have insufficient utilization of mutual information between multimodal features.Finally,due to the limited crossmodal feature extraction capabilities of the feature encoder,models based on Generative Adversarial Networks are not competitive compared to large-scale autoregressive and diffusion models.To address these challenges,this thesis studies text-to-image synthesis based on Generative Adversarial Networks,and the main contributions are summarized as follows:(1)To address the issue of insufficient utilization of fine-grained textual features in current single-stage models,a text-to-image synthesis model based on a multi-scale feature fusion mechanism is proposed.This model provides a multi-scale feature fusion mechanism that includes affine fusion blocks,attribute words and word joint blocks,and attribute word enhancement blocks.During the process of synthesizing images,different fine-grained textual features such as sentences,attribute words,and words are deeply fused with image features using affine transformations and mixed attention mechanisms.To address the problem of insufficient utilization of inter-modal feature mutual information in current models,a contrastive loss function is designed based on contrastive learning principles to fully utilize the mutual information of positive and negative sample pairs.Additionally,attention mechanisms are used to enhance the connection between the image-text features in the discriminator.Experimental results demonstrate that the performance of this model exceeds that of state-of-the-art text-toimage synthesis models based on generative adversarial networks.(2)To address the current issue of insufficient cross-modal feature extraction ability in feature encoders,a text-to-image synthesis model based on the pre-trained visual transformer model CLIP is proposed,offering a new paradigm for combining generative adversarial networks and pre-trained models in text-to-image synthesis research.The proposed model includes components such as text feature adapters and visual prompt supervision blocks,which fine-tune CLIP using visual prompts to transfer its crossmodal feature extraction ability to the current study.Feature extraction blocks are also designed to enhance the discriminator’s feature discrimination ability.In addition,CLIP similarity measurement is incorporated into the adversarial loss function to fully leverage the mutual information between multimodal features.Experimental results show that the proposed model outperforms state-of-the-art text-to-image synthesis models based on Generative Adversarial Networks and exhibits competitive performance compared to large-scale autoregressive and diffusion models while being faster. |