| The video inpainting method based on deep learning refers to extracting useful information from multiple video frames to fill the vacant areas(i.e.masked areas)in the video frames to be repaired.An effective video inpainting algorithm is very useful in video editing tasks,such as damaged video recovery,unnecessary object movement,and video redirection.Most existing video inpainting models are based on deep learning based convolutional neural networks,optical flow estimation,and attention module based network structures.Although these repair methods have their own advantages,they still have some shortcomings.Firstly,convolutional network models are good at extracting local features,but they still have certain limitations in capturing global feature representations and are unable to obtain deep level features,Limiting the ability of the model to learn and generalize high-frequency features can lead to inconsistencies in video time,resulting in a flickering and incoherent sensation in the visual effect of the video;Secondly,the repair model for optical flow estimation propagates context information through optical flow.However,obtaining accurate optical flow within the target area is not an easy task.The error results generated in the middle will continue to all subsequent repair results,that is,the errors in each step will continue to accumulate and amplify in the next step.Therefore,it is difficult to repair video frames with large motion conditions and large mask blocks.When repairing such problematic videos,it can easily lead to issues such as artifacts and content distortion in the repair results.Third,although attention models are better at grasping distant features,they can worsen local feature details and fail to capture video frame details in place,resulting in blurred and inconsistent video frames.At the same time,only stacking a single attention module can make computing very expensive.(1)A parallel network model based on the combination of convolution and attention modules is designed to address the flicker problem of 2D convolutional networks that are good at extracting local video frame information and difficult to capture global features,resulting in the repair of inconsistent video frame time.This model combines convolutional networks and attention mechanisms to form a parallel interactive network model.Combining the local features extracted by the convolutional network with the global features captured by the attention module makes the repair results of the model clearer.Combining the two networks improves performance without increasing the complexity of the model.(2)Aiming at the problems of large motion in video frames and difficulty in repairing mask blocks,a network model based on UNet variant structure was designed.Because the feature information provided by video adjacent frames and remote frames is different Therefore,in this article,we select some close and distant frames to form different reference frame sequences,extract feature information at different stages through convolutional networks and attention mechanisms,and establish more accurate correspondence between the extracted multi frame information for feature fusion and reconstruction,so that the model can obtain more inter frame information.Experimental results show that this method has good repair results in video inpainting tasks with multiple masks. |