| The cross-modal task was proposed to give computers the ability to perceive more information about the world,thus increasing their understanding and knowledge of the real world,which includes the research on this topic,text-to-image generation.The task is a natural language to vision cross-modal task,giving the machine the ability to generate corresponding images based on textual descriptions.Although many decent performances have been achieved,the instability of the generative model,and the complexity of the text description semantics,make the task still challenging and many issues deserve more re-search investment.(1)The object frame is prone to deviate or collapse during the training process,making subsequent refinement impossible.(2)Non-target regions of the gener-ated images are influenced by text.(3)The background of the generated image will usually be monotonous and blurred.Based on the problems identified above,the paper proposes the following approaches.To address the problem that the target object tends to deviate during the generation process,the paper proposes a Class-Aware skeleton Consistency Generative Adversarial Network,CAC-GAN.With the help of image classification methods and metric learn-ing methods,the CAC-GAN first obtains the class-aware features from prior knowledge.They are used as additional supervision to maintain the stability of the image during the generation process.To evaluate the integrity of generated images,the paper proposes a new metric called CACloss.The metric measures the integrity of the results by cal-culating the class-aware feature distance between the generated distribution and the true distribution.CAC-GAN obtained good results on both the CUB and Oxford-102 datasets,verifying that the method can improve image integrity.However,the results were poor on the COCO dataset,and we also explored the limitations of the method.To address the problem of non-target objects being influenced by text in the gen-erated images,the paper proposes proposes a Multilevel-Aware Consistency Generative Adversarial Network,MAC-GAN.At the entity level,we build a text-to-image-to-label structure to enhance the alignment of text-image pairs.At the feature level,we use the CLIP pre-trained model to align the features of the text-image pairs.To better evaluate the text-image consistency,we introduce a more explanatory consistency metric,F1-score,based on the image multi-label classification method.Results on the CUB,Oxford-102and COCO datasets show that this multilevel alignment method can improve the corre-spondence between text and images and reduce the impact of text on non-target regions.Some of the current generative models generate images where the background is not realistic enough or is too monotonous and blurred.Naive BN at the batch sample level,but the background varies greatly between samples,resulting in blurring and averaging of the background.To alleviate the problem,the paper proposes a Dual Conditional Instance Normalization Generative Adversarial Network(DICN-GAN).We use the sentence-level and the phrase-level representations of text as two conditions for image generation,and design a deep fusion convolution module based on the Instance Normalization method to build a single-stage generative adversarial network.Comprehensive experiments on two widely used datasets,CUB and Oxford-102,show that DCIN-GAN improves background quality and increases the diversity of the generated images’background. |