| In recent years,the text-to-image generation task has become a research focus in the field of deep learning.The text-to-image generation task is to give a sentence,which needs to contain the attribute information of the target,and then input to a specific network model,which is able to generate semantically consistent images through extensive learning.The text-to-image generation task usually requires that the generated images are as realistic and diverse as possible,while conforming to the semantic consistency of the input and output.In recent years,researchers have tended to use generative adversarial networks as the infrastructure for network models.Although images generated by generative adversarial networks can satisfy the basic requirements for image quality and realism,model training can be difficult in text-to-image generation tasks due to the limitations of the network’s own structure,which is simply an imbalance between generators and discriminators.Moreover,in order to make the generated images more realistic and semantically consistent,the model may introduce a large number of parameters,resulting in a more complex network structure.Therefore,this task is still very challenging.In this thesis,the following work is done to address the specific problem.To address the problem of unstable training process in generative adversarial networks,the thesis proposes an adversarial network based on perceptual pyramid fusion feature matching condition.The overall model adopts a pyramidal triangular network structure to fuse multi-scale contents.In the training,for the generative network,the thesis uses a perceptual loss function and a feature matching loss function,the former can enhance the detail part of the generated images,and the latter can solve the problem of insufficient reference data of the original binary generator.The experimental results show that the above network architecture not only makes the training process more stable,but also can improve the overall performance of the model.To address the problem that the introduction of a large number of parameters in the model may lead to a more complex network structure,the thesis proposes a lightweight generative adversarial network based on a non-local attention mechanism.The model embeds a non-local self-attention structure in the network to obtain global semantic information and detailed features,and uses the obtained information for coding step-by-step to generate a final more reasonable image.Experimental results show that the generative adversarial network framework incorporating the nonlocal attention mechanism generates reasonable and realistic images,while the number of parameters and computational effort of the whole model are reduced.The proposed method is validated on the publicly available COCO-stuff dataset,and several metrics such as Inception Score,FID and classification accuracy score are used to evaluate the authenticity and diversity of the generated images.The experimental results show that the quality of the generated images is improved over the previously proposed methods. |