| In the rapidly evolving field of artificial intelligence,generative AI,especially large language models based on text,has become one of the most prominent research hotspots.These models demonstrate significant generalizability and intelligence across multiple research areas and practical applications,marking a major leap in AI.Simultaneously,the field of visual generation,centered on image generation and encompassing modalities such as videos and 3D models,has become a key tool in artistic creation,media entertainment,virtual reality,and other areas of scientific research.These technologies not only greatly accelerate the creative and design processes,but also open new avenues for the creation of personalized content.Against this backdrop,research in spatiotemporal information inference and generation algorithms becomes especially crucial.Such algorithms aim to infer dynamic information from static images or understand temporal changes from a series of images,significantly enhancing the intelligence and adaptability of visual generation technologies and pushing the field towards higher levels of development.The information content of images alone often falls short of meeting the increasingly complex task requirements of today.To overcome this limitation,incorporating additional spatial and temporal information into image inference processes,thereby creating richer data,has become a key direction in the development of generative AI in the field of visual information processing.This approach helps in learning and understanding the laws of the world and further serves humanity.In this research field,core issues can be divided into two categories: spatial information inference and generation and temporal information inference and generation.Although current image generation models have made significant technical progress,becoming more powerful and precise,inferring additional spatial and temporal information atop basic image data remains a major challenge.Existing image information typically only reveals the basic outline of a scene,making it difficult for models to infer high-quality information based solely on existing data distributions.Therefore,researching and developing reliable spatio-temporal inference and generation algorithms is particularly important.This requires models to supplement spatiotemporal information stably and efficiently under diverse input image conditions.In the process of spatial information inference,special attention should be paid to the geometric accuracy of the scene and the relative positions of objects to ensure the spatial logic of the generated images.In the process of temporal information inference,the focus should be on capturing the continuity and logic of dynamic changes to generate temporally coherent and realistic image sequences.Thus,this thesis focuses on research in the spatiotemporal information inference and generation algorithms under complex conditions and has achieved the following main innovative results:1.For the prediction and generation of spatial information from input images without spatial information,this thesis constructs a framework adaptable to various types of input images,supporting both unsupervised and supervised training.For special types of images like semantic annotation maps and hand-drawn contour maps,which typically contain limited information such as semantic categories or basic outlines,lacking spatial details and depth,this thesis developed an innovative spatial information inference and generation algorithm.In the generative aspect,a double-guided normalization and a labelguided spatial co-attention were proposed to assist the model in generating and preserving spatial information during the multi-level generative process,combining both global and local information.In the discriminative aspect,a multilevel perception discriminator was designed and introduced for detailed analysis and discrimination of the generated spatial information,addressing the problem of incomplete feature learning under limited samples.Extensive experimental validation shows that our method significantly surpasses existing technologies in enhancing the spatial details and realism of images.2.For the prediction and generation of three-dimensional spatial information from multiple two-dimensional images containing spatial information,this thesis designed a collaborative inference mapping algorithm to map two-dimensional features of images to three-dimensional features in space.The challenge of this research is how to reasonably match and combine existing two-dimensional spatial information to build new threedimensional spatial information,ultimately producing natural,noise-free three-dimensional models.To this end,this thesis analyzed the reasons for the high error rate in spatial information inference and correspondingly proposed a cross-view cooperative reasoning warping.By precisely matching and combining two-dimensional spatial information from different perspectives,spatial features are distinctively mapped onto the occupied and unoccupied areas of three-dimensional space.Experimental results show that this method significantly reduces incorrect predictions of spatial information,reduces noise,and generates more accurately positioned object surfaces,and can be adapted to existing depthbased and non-depth-based models.3.For the prediction and generation of temporal information from input images containing spatial information,this thesis proposed an efficient self-recurrent temporal information inference network and a coherence-consistency training mode.In this part of the research,the input images processed by the model contain basic spatial information such as clear objects and scenes.The goal of the research is to build dynamic object movements and scene changes on this basis.The challenge of this research lies in efficiently inferring time information that conforms to each object’s movement law based on definite spatial information,presenting a coherent and natural picture as a whole.To this end,this thesis proposed dual adversarial training and optical flow optimization techniques,constituting a consistency training from local to global,effectively improving the issues of object distortion and inconsistency between motion and stillness,achieving the coherence and realism of temporal information.Experimental results show that our method surpasses existing technologies in accuracy and efficiency of temporal information inference,effectively generating dynamic scenes while maintaining the stability of spatial information.4.For the extended learning problem of time information generation for image generation models with existing spatial information processing capabilities,this thesis explored the method of extending from image pre-trained models to video generation.Image generation models trained with large-scale data already possess strong spatial information processing capabilities to generate high-quality images.Based on this,allowing the model to further learn to infer and generate temporal information is a necessary research area in generative AI.this thesis proposed an innovative two-stage training method,which can further train the pre-trained image latent variable diffusion model into a video diffusion model,effectively inferring and generating temporal information while retaining the original spatial generation capabilities.This method also supports rapid migration and adaptation to different spatial domains,demonstrating strong transferability and generalizability,showcasing powerful spatiotemporal information generation and inference capabilities. |