Font Size: a A A

Research On Video Summarization Based On Semantic Guidance

Posted on:2024-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:P TangFull Text:PDF
GTID:2568307079976539Subject:Electronic information
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and the widespread use of smart devices,there has been an explosion in the amount of video data.Therefore,how to manage and utilize these video data has become an urgent problem.Video summarization attracts widespread attention,as it can summarize the semantic content of video more succinctly and efficiently by selecting the most representative parts from the original data.Additionally,this technology can save the time of audiences to improve their experience and satisfaction,and also benefit the downstream tasks,such as video retrieval,event detection and anomaly recognition,etc.However,existing video summarization models have structural limitations when dealing with long video sequences.Additionally,most existing methods focus solely on visual cues while ignoring other modalities such as audio and text.Incorporating different modalities of information in videos can intuitively enhance their semantic expression,aiding AI agents in better understanding the content.Thus,comprehensively utilizing multi-modal information is an effective way to improve the quality of video summarization.To address these issues,this thesis proposes two distinct improved algorithms for video summarization.1.Aiming at the shortcomings of existing methods when dealing with long video sequences,this thesis proposes a model that can take the advantages of both LSTM network and Transformer network.Firstly,the model uses Transformer to process the features of each video to shorten the depth of the LSTM network and avoid the problem of gradient disappearance.Then,the LSTM network is used to integrate the importance of each video to ensure that the current video segment can fuse the content of previous video segments.Finally,in order to verify the effectiveness of the model,this paper conducts experiments on a self-made dataset based on the popularity of Danmaku and two commonly used datasets Sum Me and TVSum,and the results show the superiority of the model.2.Aiming at the problem that the existing methods do not make full use of the multimodal information in the video,this thesis proposes a video summarization model based on multi-modal semantic fusion.The model can efficiently fuse the visual,audio and text information of a video to generate a video summary.Firstly,the features of visual,text and audio modalities are extracted respectively.Then,these features are fused by using a multi-modal Transformer network,which can adaptively adjust the weight parameters between each modality.Then,the fused features are input into a two-layer attention network,which focuses on the global and local information in the video to improve the quality of video summarization.Finally,extensive experiments are conducted on two public datasets,and the results demonstrate the superiority of the proposed model on the task of video summarization.
Keywords/Search Tags:Video Summarization, Multimodal Semantics, Feature Fusion, Long Sequence Video
PDF Full Text Request
Related items