Video Dense Event Description Text Generation Based On Multi-head Self-attention Mechanis

Posted on:2024-06-26

Degree:Master

Type:Thesis

Country:China

Candidate:Y B Li

Full Text:PDF

GTID:2568307106484104

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

With the continuous progress of science and technology and the continuous development of deep learning,video description is a research topic that adapts to the development of the times and is the focus of many domestic and foreign researchers.After years of development,a series of achievements have been achieved.Video descriptions require cross modal tasks that handle images,text,and audio.The goal of video description is to provide a video segment and generate a description based on video content information.The dataset used is annotation text for a short video and video.Dense video description is a more complex video description task than video description,requiring analysis and processing of a longer video.There are multiple events in this long video,and multiple events need to be extracted and described separately.The main task of this article is to study the two tasks of video description and dense video description.A video description method that integrates visual and speech features has been proposed.In response to the lack of guidance on audio information in existing video description tasks,a video description that integrates visual and speech features is proposed.The Mel-Frequency Cepstrum is used to extract features from the audio information in the video,which is added to network training to add feature information of different modalities in model training.It can guide the model to generate more effective text information.In addition,in the visual feature extraction stage,this article uses a relatively advanced model framework,Vision Transformer,to further accurately obtain visual feature information,providing more effective information for subsequent text generation.A large number of comparative experiments were conducted on the MSVD and MSR-VTT datasets,with ROUGE＿L indicator has increased by 0.2 and1.2.Proposed a dense event description method based on visual semantic embedding.Each event in the existing research is localized only according to visual clues,while ignoring the relevant Semantic information and context information.A dense event description method based on visual semantic embedding is proposed,which consists of visual semantic embedding,hierarchical description transformation,and proposed network modules.By capturing contextual information associations of words,obtaining more effective word information through n-grams,and finally achieving joint embedding of visual semantics and text information.Finally,embedding temporal information through hierarchical description transformation,and finally partitioning and extracting video events through proposed network modules.Finally,it is fed into a multi head self-attention network for text generation.Conduct experiments and validation in the Activity Net dataset,and in ROUGEL＿L has increased by1.9,leading the current average level.

Keywords/Search Tags:

Video Description, Voice Features, Dense Events, Semantic Embedding, Multi Head Self-attention

PDF Full Text Request

Related items

1	Research On Multimode Dense Video Description Model Based On Transformer Network
2	Video Captioning Algorithms Based On Multi-head Attention Mechanism
3	Dense Video Captioning Based On Part-of-speech Tagging And Attention
4	Research On Intrusion Detection Method Based On Semantic Re-Encoding And Multi-Head Attention Mechanism
5	Research On Video Description Method Based On Deep Learning
6	Research On Video Content Description Method Based On Multi-scale Features And Temporal Semantics
7	Research On Key Technologies Of Self-Attention Deep Learning Model For Video Semantic Understanding
8	Research On Description And Recognition Of Events In Surveillance Video
9	Research On Hybrid Video Recommendation Algorithm Based On Multi-Head Self Attention Mechanism
10	Research On Intensive Video Description Based On Multi-mode Transformer And Ancho