In the era of big data,the way to get image information is mainly to search in existing images.But due to the complexity of the content of images,it is difficult to find the information that is really needed.In order to the computer automatically generates meaningful images according to requirements,the task of text to image has attracted people's attention.Text to image is a cross-modal task.When the input is a sentence or a piece of text,the image conforming to the semantic information of the text is output.This task not only requires the computer to understand the semantic information of the text,but also converts the semantic information into pixels.It's a very challenging task.However,with the rapid development of deep learning,especially the generative adversarial networks,many methods have emerged to improve the quality of text to images.However,due to the complexity of this task,the quality of the generated image needs further improvement.In order to improve the quality of text to images further,the works of this paper are as follows:(1)The model of text to image combined with category and reconstruction information.In the task of text to image,in order to improve the quality of the generated images,this paper proposes the generative adversarial networks models combined category and reconstruction information.The model is trained in two stages.We can generate low-resolution images in the first stage,then,we synthesize high-resolution images in the second stage.In each stage of the text to image,category information is added at the end of the discriminator,and the reconstruction information of the pixels and features is added to the loss of the generator to assist the training of model.Experimented on the Oxford Flowers、Caltech UCSD Birds and MS COCO datasets,the experiments show that the model proposed effectively improves the quality of the text to image.The color and details of the images generated are more delicate.(2)The model of complex scene text to image combined with visual dialog.In the task of text to image in complex scenes,the text description can't contain most of the details of the image,and it is difficult to generate high-quality images.Therefore,the paper proposes a model to improve the quality of text to image in complex scenes.In order to pay attention to the correlation between visual dialog and the local area of the corresponding images,the visual-semantic embedding model based on attention is constructed to obtain the feature representation of visual dialog.Then,the feature is spliced with the text description feature obtained by the description encoder,and the image corresponding to the text is generated by the generative adversarial networks models combined visual-semantic embedding and category reconstruction.Experiments were conducted on the MS COCO dataset,the experimental results show that the images generated after adding the visual dialog are more clear,the colors and details are more abundant,which effectively improves the quality of the text to image. |