| In the daily life of humans,3D models play an increasingly crucial role in the display and interaction of everyday activities.However,in many cases,it is often difficult to obtain accurately captured images of the human head due to scene interference and the distance and angle between the subject and the camera.Therefore,it is of great significance to study how to generate 3D models of the human head that are not affected by environmental lighting using simple face mask images.With the continuous advancement of computer vision technology,attention mechanism models(Transformers),which play a very important role in natural language processing,have also been widely used in visual semantic feature extraction.By encoding semantic mask images into latent variables using Transformers and encoding them into the desired scene’s latent code,we can use multiple layers of fully connected layers and embedding layers to find the correlation between semantic information.This allows us to combine the spatiotemporal features between the mask image and the 2D image,enhance the semantic information dependency of the neural network structure,and generate more detailed structures.In this paper,we address the above-mentioned issues by designing a semantic information module that adopts an encoder-decoder structure.In the encoder,a mapping network(Custom Mapping Network)is used to map the input noise vector(z)to a high-dimensional space,and the Swin Transformer semantic mask is encoded into the 3D generation model to be trained.In the forward process of the 3D generation model,the multi-resolution input image is divided into multiple sub-blocks to effectively process multi-resolution images,maintaining a balance between computational efficiency and model performance.The spatial encoding information sampled from spatial points is concatenated with the low-level semantic features in the decoder,and the desired3 D spatial implicit representation is obtained by aggregating the hidden encoding through MLP.To validate the neural radiation field technology based on generative adversarial networks,experiments were conducted based on the CelebAMask-HQ and CatMask datasets.The experimental results show that the proposed implicit neural expression model based on Swin Transformer achieves FID scores of 40.6 and 24.1,as well as IS scores of 2.15 and 2.50 on the CelebAMask-HQ and CatMask datasets,respectively,indicating significant improvements.Considering that our demand for head models is not limited to simple viewing of human head images,we need to parameterize the head models.By separately encoding facial expressions,hairstyle,clothing,and pose,and incorporating the desired control encoding information into the generation process,high-quality manipulation of human head models can be achieved.To this end,this paper utilizes 3DMM for facial information element extraction during training,combines the Swin Transformer introduced in Chapter 3 for feature fusion of input facial images,and uses multiple facial encoders to transform the input images into multiple domains representing fine details of the head.Finally,adversarial training using generative adversarial networks is employed to obtain realistic and controllable human head models.A 3D viewer software is implemented,allowing real-time rendering and manipulation of human head dynamics. |