| Video frame prediction is a significant research task in the field of computer vision.By analyzing the input video frame sequence,the video frame prediction algorithm models the spatio-temporal features and summarizes the dynamic patterns of the input data,and accurately infers the subsequent video frames.By constructing video frame type data for different observation objects,video frame prediction algorithms can be applied in various fields to enhance human productivity,daily life,and scientific research activities.For example,by constructing video frame sequence data for vehicle driving scenes,the changes in the road ahead and the movement trends of pedestrians in advance,which can assist autonomous vehicles in making decisions in advance.By serializing atmospheric distribution maps into video sequences and using video frame prediction algorithms to extrapolate future atmospheric distribution,it is possible to build an early warning system for extreme weather events such as precipitation and hurricanes.Due to its wide range of potential applications,research on video frame prediction algorithms holds significant value in both practical and theoretical aspects.In this dissertation,we conducted research on video frame prediction algorithms from various perspectives based on the deep learning method.Specifically,in response to the characteristic of short lengths of observable video frames in prediction tasks,a progressive fusion method for spatiotemporal features is proposed for video frame prediction.For generalized video frame prediction tasks,an analysis of video data characteristics is conducted,and a video frame prediction algorithm is proposed that incorporates dimension decoupling and attention mechanisms.Considering the growing trend of lightweight deep neural network models,an analysis is carried out and a lightweight video frame prediction model is proposed.The main research content and contributions of this dissertation are as follows:1.A progressive fusion method for spatiotemporal features in video frame prediction.When the observable video frame length is short and there are fewer frames available for feature extraction,a single network model faces difficulties in simultaneously modeling spatial and temporal features.This can result in ambiguous prediction results in the spatial dimension.This dissertation conducts research on video frame prediction algorithms from different perspectives based on deep learning methods.Specifically,the model consists of two sub-networks.The temporal sub-network models motion features.It preserves static features that remain unchanged over time by utilizing masks,thereby achieving the separation of dynamic foreground and static background.The spatial sub-network is responsible for capturing spatial dimension features and constructing the appearance information of the prediction results.Then,to integrate spatial-temporal feature information,a progressive fusion algorithm for spatiotemporal features is proposed.From the perspective of different feature information contained in the network layer output,three different fusion strategies are proposed: dense feature fusion,sparse feature fusion,and high-level feature fusion.The most effective fusion method is explored and verified through experiments.The performance of the proposed model is validated and evaluated on the UCF-101 and KITTI datasets.Compared with the comparative methods,the proposed method achieves the best performance in terms of PSNR,demonstrating the effectiveness of the proposed method.2.Dimension decoupling and attention mechanism for time series predictionExisting video frame prediction algorithms mainly improve the quality of prediction results by optimizing the internal structure of Convolutional Long Short-Term Memory(Conv LSTM).However,such optimization processes typically introduce a large number of functional modules,resulting in a significant increase in the number of parameters of the baseline Conv LSTM model.Moreover,analysis shows that the prediction model based on Conv LSTM has certain drawbacks,such as prediction biases in spatial positions and blurry appearance of prediction results.Nevertheless,the predicted results generated by this method still contain rich feature information that can be used to reconstruct the predicted results.Based on the above observations,this dissertation proposes a Multi-Attention LSTM model(MA-LSTM)based on dimension decoupling and attention mechanisms.This model integrates two modules,namely the Dimensionality Decoupling Module(D2M)and the Channel Attention Module(CAM),into the Conv LSTM framework.The D2 M module compresses the two-dimensional spatial features and establishes a motion model in the decomposed low-dimensional space to reduce the modeling difficulty of motion features and enhance the propagation of motion patterns in the temporal dimension.The CAM module captures the channel-level representation of spatial global features and strengthens the effective channel features through attention mechanisms.This optimizes the overall structure of predicted results in the spatial dimension and enhances the spatial feature representation capability of the prediction results.Compared with existing methods,the proposed method improves the qualitative and quantitative indicators of the prediction results,validating its effectiveness.3.A lightweight video frame prediction algorithm with multi-granularity feature and asymmetric motion patterns.To further improve prediction accuracy,models are evolving towards larger and deeper structures.However,this change requires high hardware storage and computational power,making it difficult to deploy and run in typical environments.Therefore,lightweight algorithms are often introduced to balance model accuracy with parameter size.Existing lightweight algorithms mainly target image-level models,with limited research on lightweight methods for video network models.Based on this premise,we propose a lightweight video frame prediction algorithm.The proposed method mainly consists of three components: Asymmetric Convolutional Kernel(ACK),Fine-grained Feature Extractor(FFE),and Coarse-grained Feature Fuser(CFF).ACK is a lightweight asymmetrical convolutional kernel that decomposes motion directions and constructs motion models in different directions.FFE enhances the spatial feature extraction capability of the model by introducing non-linearity into the network.CFF integrates different-level feature through skip connections,thereby improving the model’s efficiency in utilizing features and shortening the error backpropagation path.Experimental results show that compared to the Conv LSTM baseline model,the proposed method not only achieves improved accuracy but also significantly reduces the number of parameters.Compared to existing methods,the proposed model can achieve comparable accuracy while significantly reducing the parameter amount.In conclusion,this dissertation proposes video frame prediction algorithms based on the considerations of observable sequence length,characteristics of video frame prediction tasks,and model lightweighting.Experimental evaluations and comparisons are conducted on various datasets to analyze and validate the effectiveness of the proposed algorithms.Furthermore,taking into account the expansion of hardware acquisition devices and the trends in the development of artificial intelligence systems,we provide an analysis and outlook on the future research and development of video frame prediction in terms of its practical applications. |