Font Size: a A A

Research On Key Technologies Of Video Summarization Based On Deep Information Extraction

Posted on:2024-07-23Degree:MasterType:Thesis
Country:ChinaCandidate:H SunFull Text:PDF
GTID:2568307136992579Subject:Electronic information
Abstract/Summary:PDF Full Text Request
Video summarization is an intelligent video processing technique that involves automated analysis and processing of video content using machine learning and computer vision techniques.It aims to generate images or short videos that capture the key information from the video content.The primary objective is to assist users in quickly browsing and comprehending video content while reducing the time and effort required to search for specific videos.Video summarization research encompasses various types of videos,including everyday videos,professional sports videos,and surveillance videos,among others.Due to the significant advancements achieved by deep learning networks in various research domains,the use of deep learning networks to extract the depth information of videos for generating video summaries has emerged as a crucial research question in the field of intelligent video processing.In this thesis,we focus on the spatial features of video frames as input and process these features to obtain depth information.We specifically investigate the generation of static video summaries and dynamic video summaries for motion videos and everyday videos,respectively.The main innovations of this thesis are as follows:(1)A motion image generation algorithm based on depth information extraction is proposed for sports videos.The algorithm utilizes the deep spatial features extracted by the Goog Le Net network from video frames and consists of three processing modules.Firstly,we employ a Bi-LSTM(Bi-directional Long Short-Term Memory)network to model the temporal information between video frames.We incorporate a two-stage attention mechanism that applies different attention strategies to different network models.By computing the contribution of each region in the input video frames,we adjust the weights of each region based on their respective contributions.This enables the network to focus more on the regions containing the moving objects.Secondly,we use a multilayer perceptron to predict the importance scores of motion video frames and the inter-frame similarity based on the extracted depth information.To enhance the diversity and reduce redundancy in the selected key frame sequences,we employ seq DPP(Sequential Determinantal Point Process)to select key motion frames based on frame-level importance scores and inter-frame similarity.Finally,we employ the Canny edge detection algorithm to detect the edges of moving objects in the selected key action frame sequences.We extract the contour information of the moving objects based on the detected edges,and then construct convex hulls that encompass the moving objects to obtain their spatial information.We perform collision detection on the moving objects in different key frames based on the spatial information.Subsequently,we select suitable motion trajectories and generate motion image summaries.(2)In the context of everyday videos,we propose a video summarization approach based on the differential regularization multi-head self-attention mechanism.Similarly,we utilize the deep spatial features extracted by the Goog Le Net network from video frames as input.We then employ a simple self-attention module to process the deep spatial features,replacing the cumbersome recurrent neural networks.The self-attention module consists of three stacked attention layers,each containing two sub-modules.The first sub-module employs a differential regularization enhanced multi-head self-attention mechanism to simultaneously focus on different positions and subspaces,effectively processing the information and ensuring diversity by expanding the distance between multiple attention heads through differential regularization.The second sub-module is a fully connected feed-forward network that receives the output from the multi-head self-attention mechanism and performs nonlinear transformations to enhance the model’s representational capacity,providing richer and more advanced feature representations for sequence modeling tasks.Next,a linear layer is employed to predict frame-level importance scores and detect inter-frame similarity based on the deep information processed by the self-attention module.Finally,we utilize seq DPP to select key shots and generate dynamic video summaries based on shot importance scores and shot similarity,thereby increasing the diversity of key shots in the dynamic video summaries and improving the accuracy of the video summarization process.
Keywords/Search Tags:Video summarization, Depth information, Two-stage attention mechanism, Bi-LSTM, Multi-Head Self-Attention Mechanism, Differential regularization
PDF Full Text Request
Related items