Font Size: a A A

Deep Learning Based Conditional Visual Content Generation And Application

Posted on:2023-09-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:J N ZhangFull Text:PDF
GTID:1528306833498464Subject:Electronic information
Abstract/Summary:PDF Full Text Request
With the rapid development of deep learning and the continuous improvement of computing power,the technology of conditional visual content generation is booming and has achieved a lot of remarkable achievements,which has significant application values in the hot fields of panentertainment,meta-universe,and virtual human.However,limited by the gap between the current technology and the requirements of high-standard applications,the efficiency and ease-of-use of the generation model need to be further improved.Also,how to generate high-quality and reasonable visual content through the given conditional inputs(i.e.,image,audio,motion information,etc.)is still a challenging problem to be solved at present.This thesis focuses on the research of deep learning-based conditional visual content generation,and specific problems at different levels are summarized and studied from shallow to deep,according to the principle of increasing understanding of information.On the one hand,low pixel-level tasks focus on local structure perception and distribution prediction,and the biggest challenge is how to generate reasonable and high-quality images while designing efficient models to support various terminal applications.On the other hand,high semantic-level tasks need to understand the semantic information in images,which also put forward significant challenges about the high-resolution image generation,high-precision model design,and the optimization procedure.Especially in the transition to video generation tasks,these difficulties become increasingly prominent,and additional modelings of temporal rationality and diversity generation need to be considered.Given the above challenges,this thesis studies typical image colorization and superresolution tasks that generate higher-quality visual content from low-level perception;as well as semantic face swapping,face animation,and image dynamic tasks to generate richer images and videos from high-level understanding controllably.The research contents and main contributions of this thesis are as follows:1.Starting from the local structure perception of image translation,this thesis studies the image colorization and super-resolution problems,and we propose an end-to-end framework to solve simultaneously image colorization and super-resolution efficiently.Specifically,a novel pyramid valve cross attention module is designed to support both automatic and referential colorization,which can not only understand and aggregate the color information of reference images better but also has a strong interpretation.Also,a continuous pixel mapping module is proposed to meet the application requirements of arbitrary image magnification,improving the model accuracy with less computation.2.Starting from multi-condition constrained texture transfer,this thesis studies the imagelevel face swapping problem and proposes a novel region-aware face swapping method for more delicate modeling,which includes a facial region-aware branch and a source feature-adaptive branch.The former effectively models non-overlapping multi-scale facial semantic interactions by introducing a global attention mechanism,while the latter complements global identity-related cues to ensure identity consistency for the generated image further.In addition,this study proposes an unsupervised facial mask prediction module further to improve the accuracy and practicability of the model.3.Starting from multi-condition constrained geometry editing,this thesis studies the imagelevel face animation problem and proposes a multi-identity face animation model,which follows the decoupling idea of face geometry and texture information.This model consists of a welldesigned face landmark converter branch for facial movement migration of different identities in geometric space,as well as a geometry-aware generator branch to generate animated face images,realizing the multi-identity face animation task on the basis of ensuring the generation quality.Simultaneously,this framework is extended to the audio-guided multi-identity face animation task,and we design an audio feature fuser module as well as a geometry controller module for efficient audio feature extraction and injection,respectively.Also,a high-quality Ann VI dataset is proposed to support high-resolution audio-guided multi-identity face animation research.4.Starting from the motion constrained image sequence generation,this thesis studies the image dynamic problem and designs an end-to-end dynamic video generation framework based on the idea of decoupling motion and texture information.In detail,our approach consists of an optical flow encoder and a dynamic video generator.The former encodes the optical flow information representing video motion into a normalized vector,and it provides an easy inference manner to generate various videos by randomly sampling motion vector;The latter generates reasonable target dynamic video based on a single input image under the control of the motion vector.In addition,given the poor quality of the current time-lapse video dataset,this study proposes a large-scale high-resolution QST dataset to support the ongoing research on this task.On the above research contents and achievements,this thesis conducts massive experimental evaluations on several mainstream datasets,and the results prove the effectiveness and superiority of the proposed methods.In this thesis,excellent research results have been achieved in conditional visual content generation based on deep learning,and some studied models have been used in commercial products with great application values.
Keywords/Search Tags:Computer Vision, Deep Learning, Conditional Visual Content Generation, Image Colorization, Image Super-Resolution, Face Swapping, Face Animation, Image Dynamic
PDF Full Text Request
Related items