Font Size: a A A

Video Prediction Based-on Spatial-temporal Fusion And Non-local Block

Posted on:2021-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:L FengFull Text:PDF
GTID:2428330626456031Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the development of deep learning techniques,there have been many progresses in various applications of computer vision,including image classification,video super resolution,video classification,etc.In this thesis,we focus on video prediction,a very challenging task of computer vision.The goal of video prediction is to produce n consequent new frames given m previous consequent frames of a video sequence.Specifically,m is set to be 2 and n is set to be 1 in this thesis.Compared with images,videos contain not only the spatial dependencies but also the temporal dependencies.Therefore,in this thesis,we propose a video prediction algorithm based on spatial-temporal fusion and non-local block to exploit both the spatial and temporal dependencies in the video.The main contributions are shown as follows.In Chapter 1 and 2,we introduce the research background and motivation,and review the state-of-the-art related works.We focus on the modeling of the spatial-temporal dependencies of the video,and study the common spatial-temporal fusion methods,including those methods based on temporal,spatial and progressive fusion.Then,we introduce the non-local block for the purpose of capturing long-range dependencies in a comprehensive manner.In Chapter 3,we propose a video prediction route and framework based on spatial-temporal fusion and non-local block.Specifically,we utilize an encoder-decoder as the basic structure,including an encoder,a bottleneck layer,a kernel-generating decoder and a mask-generating decoder.The encoder encodes every frame in video sequence and extracts their features.The bottleneck layer adopts different spatial-temporal fusion methods to capture the temporal and spatial dependencies of the video feature sequence.The kernel-generating decoder and the mask-generating decoder take the features of the bottleneck layer as inputs to generate kernels and masks respectively.The core idea of this framework is that the dynamically generated kernels are applied to convolve with the last input frame to produce transformed images at each time step,and then the generated frame at last time step and these transformed images are combined into one frame by masking.On this basis,we design the video prediction network based on direct fusion.The experimental results demonstrate the effectiveness of the proposed framework.We also analyze the effect of direct fusion and its combination with non-local block on video prediction performance.Further,we utilize the generative adversarial training strategy for the networks to improve the prediction performance by reducing the blurring artifact of the predicted frames.In Chapter 4,we select ConvLSTM as the spatial-temporal fusion method to overcome the problem that direct fusion ignores the influence of each frame as an individual on the predicted frame.The experimental results demonstrate that the spatial-temporal dependency modeling of ConvLSTM is stronger than direct fusion.In addition,we utilize the non-local block to optimize the structure of the video prediction network based on ConvLSTM,and discuss the influence of long-range dependency modeling of non-local block on the performance of video prediction.In Chapter 5,we select progressive fusion as the spatial-temporal fusion method to solve the problems that the direct fusion fails to capture the relationship between each frame as an individual and the predicted frame as well as the non-local block cannot play the role of long-range dependency modeling in video prediction based on ConvLSTM.The experimental results demonstrate the effectiveness of the video prediction algorithm based on progressive fusion and non-local block.In addition,the algorithm proposed in this thesis is compared with the directly generated and stream-based typical algorithms.
Keywords/Search Tags:video prediction, spatial-temporal fusion, non-local block
PDF Full Text Request
Related items