| In the Artificial Intelligence field,multiple-modality tasks,normally,are more complicated than single-modality tasks.Hence,many models which involve multiple modalities have not meet the standard of being widely applied yet.Text-to-image generation is a typical multiple-modality task.It requests that the images generated by the model should keep the semantic consistency with their descriptions.Currently,because of its challenge and potential applicability,it is of broad interest to investigate this task.At present,since Generative Adversarial Networks(GANs)improve the reality of the generated images,many GANs-based text-to-image generation methods have been proposed.This paper realizes two defects of the current text-to-image model:(1)The traditional text-image feature fusing layer only focus es on word features and the pixellevel features in the image feature maps and do es not explore the relations between words and region-level features;(2)Although deep convolution networks can expand the text-refined features to large areas,this process suffers from capturing complicated geometric structures.In short,conventional GANs-based text-to-image generation methods process features in a locally restrained way,ignoring the powerful force of the long-range dependency.To cope with the above defects,this paper proposes a novel region-wise attention-based fusion layer,dubbed as Region-wise Attention(RWAttn).This module allows long-range dependency modeling and regional feature maps refining for the text-to-image model,which mitigates the defects of current methods.This model regionalizes the intermediate feature maps according to the semantic relations between feature points and semantic entities.Then,this model optimizes the intermediate feature maps based on the region features.In this way,this model releases the convolutional networks from building long-range dependency relations.This can be viewed as a supplement to the original short-range focused generator.As an extra bonus,this paper finds the defect of the widely used loss for text-toimage generation---Deep Attentional Multimodal Similarity Model Loss(DAMSM).The distribution of the generated images is far distinct from the distribution of the generated images.In this way,DAMSM is not efficient enough to optimize the generation process.In this paper,we calculate the posterior probability of the text by incorporating real images from the dataset.This means that,our model distinguishes the generated images from the real images,which forces the generator to synthesize images that are semantic consistent with the candidate captions.Experiments based on two wised-used benchmark datasets(CUB and COCO)prove the efficacy of our proposed model.Our model achieves compelling re sults against the state-of-the-art methods. |