| Breakthroughs in deep learning have greatly advanced computer vision applications and image processing techniques.With the continuous improvement of hardware capabilities,the research field of image generation has also been rapidly developing.Researchers have started to focus on establishing a system that can understand the relationship between vision and language and create images that reflect the meaning of textual descriptions.The task of textto-image generation can be understood as the process of generating real images that align with given textual descriptions.Existing text-to-image generation methods mostly focus on improving the quality of generated images or enhancing the alignment between images and textual descriptions,while neglecting the controllability requirements for generating interactively editable images.In addition,due to the uneven distribution of training sample categories in the text-to-image generation model,the text description cannot be effectively matched when generating images with few training sample categories.To address the issues of insufficient controllability in image generation and low image quality caused by sparse training sample categories.A Layout Net framework is proposed which utilizes image layout for text-to-image generation.The main contributions of this study are as follows:1.To create layout images that are more reasonable and align with user requirements,a text-to-layout image generation component based on an improved DF-GAN is utilized.Initially,text information is preprocessed,and entities are extracted using a large language model.The text information and entity list are then encoded into feature vectors and fed into the generator to generate layout images.2.In order to lower the fine-tuning threshold of the latent diffusion models,the diffusion model is employed as a backbone network,incorporating a conditional mixing module to integrate the layout image information.A new U-net architecture is constructed by copying the weights of the diffusion model’s encoder,allowing the gradual input of layout image information into the latent diffusion model without modifying its original weights.3.In view of the differences in the distribution of model training data,it is difficult to obtain images that meet the text description in image generation tasks with a small number of samples in the training data set(e.g.,specific individuals,and locations),a method is proposed based on enhancing image generation with real-world objects.This method extracts entities from the text information and retrieves corresponding object images from a local material library or the internet.These object images are then stitched into the layout image as inputs to the model,providing constraints on the generated image features.The proposed model is evaluated on two benchmark datasets: CUB-200-2011 and MSCOCO,demonstrating superior visual performance compared to existing approaches.The model presented in this study achieves better visual results and controllability in generating images through manual adjustments of the layout images.Additionally,for image generation tasks with a limited number of samples in the training dataset,the proposed method for enhancing image generation based on Layout Net achieves similar effects to approximate model fine-tuning without the need for additional training. |