Font Size: a A A

Research And Application Of Text-to-Image Technology Based On Multi-modal Pre-training

Posted on:2023-08-05Degree:MasterType:Thesis
Country:ChinaCandidate:B YangFull Text:PDF
GTID:2568306914971699Subject:Intelligent Science and Technology
Abstract/Summary:PDF Full Text Request
Text-to-Image generation refers to allowing computers to generate corresponding images directionally according to a given language description.This technology has a wide range of application requirements in industries such as cultural content and creativity,and has received extensive research attention.However,the current model evaluation methods cannot accurately evaluate the comprehensive performance of Text-to-Image models,and the existing Text-to-Image technologies are also difficult to ensure the quality of the generated images and the consistency between the generated images and text descriptions.To this end,this paper conducts a series of researches on three aspects of Text-to-Image technology,and achieves the following results:Firstly,aiming at the problem that the evaluation indicators used in the current Text-to-Image task are inaccurate and cannot measure "Image Quality" and "ImageText Consistency" at the same time,this paper proposes a cross-modal distance(CMD)to measure Text-to-Image models.CMD will use a third-party model with excellent consistency modeling ability to perform the evaluation task.CMD evaluates the similarity between the generated image and the original image and the mapping deviation of the image relative to the text by mapping the features of the generated image,the original image and the text description into a semantic space.Experiments show that the CMD indicator can effectively avoid the bias of the existing evaluation indicators,and can simultaneously evaluate the generated "Image Quality" and "ImageText Consistency".Secondly,aiming at the problem that it is difficult for existing Text-to-Image models to ensure that the generated images conform to text descriptions,this paper proposes an efficient Image-Text consistency modeling model—image-text matcher(ITM),and builds the ITM-GAN model on this basis.The ITM module is built on a large-scale multimodal pre-training model,which can accurately evaluate the similarity of images and texts.Experimental results on public data show that ITM-GAN has achieved significant performance improvements compared to previous models.This paper also uses a variety of consistency modeling models with different performance to train the generative model.The in-depth analysis of the experimental results shows that the "Image Quality" and "Image-Text Consistency" in the Text-to-Image task are related to each other.The two need to be balanced during the training process to train the model more efficiently.Thirdly,aiming at the problem of poor quality of images generated by existing text-generating image models,an gradual refinement generator(GRG)that gradually refines language constraints is proposed.This structure progressively uses linguistic information from coarse-grained sentences to fine-grained vocabulary in a multi-level generative network to constrain image generation,and combined with ITM to build a GR-GAN model.The experimental results on public data show that GR-GAN surpasses the current state-of-the-art generative model performance,and can achieve better"Image Quality" and "Image-Text Consistency" at the same time.Further experimental analysis also shows that using semantic information from coarse to fine and optimizing the generated images step by step is a more reasonable generator construction method.Finally,in view of the above researches,this paper conducts a sufficient visual experimental analysis.On models such as ITM-GAN and GR-GAN,this paper automatically generates a batch of high-quality images that conform to the text description through a series of text inputs.On this basis,a qualitative analysis is carried out on the "Image Quality" and "Image-Text Consistency" generated by the Text-toImage model.
Keywords/Search Tags:Text-to-Image generation, generative adversarial networks, multimodel, Image-Text Consistency modeling
PDF Full Text Request
Related items