Research On Multimode Dense Video Description Model Based On Transformer Network

Posted on:2024-04-30

Degree:Master

Type:Thesis

Country:China

Candidate:X Li

Full Text:PDF

GTID:2568307184455984

Subject:Instrument Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of computer technology,generating multi-sentence descriptions for videos has become one of the most challenging computer vision tasks,as it requires not only visual relevance but also discourse coherence between sentences in a paragraph.Dense video description aims to generate dense descriptions for all possible events in an untrimmed video.This task is challenging in that it requires accurate localization of the events in the video and simultaneous description of each event in sentences.Most of the current models use a two-stage approach,which leads to low model efficiency,and dividing the task into two parts with each stage independent of each other,makes the model not well applied to the information of the context;For feature extraction,most of the previous works of dense video description are based on visual information only,ignoring audio information and semantic information,which leads to incomplete description results.In order to solve the above problems,this paper proposes a multimodal fusion dense video description method based on Transformer network.First,the end-to-end dense video description model is designed,the model applies parallel prediction head,and the target detection algorithm Detection Transformer network is applied to the dense video description to realize the end-to-end description of the video and generate video descriptions that can be correlated with each other.For the encoder-decoder framework,the multi-scale deformable temporal attention module is proposed that enhances the ability of the model to apply local information,speeds up the convergence of the model while improving the model accuracy;On the basis of the end-to-end model,multimodal feature fusion information is added,and an adaptive R(2+1)D network is proposed to extract visual features and add audio features to complement them to generate a richer and more accurate description.A semantic detector is designed to generate semantic information to improve the semantic inconsistency problem of video description results.Comparative experiments and ablation experiments were conducted on large benchmark datasets Activity Net caption and You Cook2,and better results were obtained on all four evaluation metrics.On the Activity Net dataset,the evaluation metric BLEU＿4 on reached 2.17,which is better than the best result so far,and METEOR,CIDEr,and SODA reached 9.23,42.14,and 6.05,respectively.For the You Cook2 dataset,the experimental model performed well on all four scores,and BLEU＿4 reached 0.92.The experimental results demonstrate that the model performs differently on both large benchmark datasets Activity Net caption and You Cook2.The model in this paper applies an end-to-end approach using multimodal feature inputs,which leads to a significant improvement in model performance.

Keywords/Search Tags:

Dense video description, Transformer network, Semantic information, Multimodal fusion, Deformable attention

PDF Full Text Request

Related items

1	Research On Multimodal Interactive Information Fusion Method Based On Transformer
2	Video Dense Event Description Text Generation Based On Multi-head Self-attention Mechanis
3	Research On Intensive Video Description Based On Multi-mode Transformer And Ancho
4	Dense Video Captioning Based On Part-of-speech Tagging And Attention
5	Research On Video Description Algorithm Based On Visual Semantic Understanding
6	Research On Video Classification Method Based On Multimodal Temporal Information Modeling And Fusion
7	Video Fusion Analyzing And Semantic Understanding
8	Multi-modal Dense Video Captioning Method Based On Transformer
9	Research On Cross-model Video Retrieval And Description Algorithms Based On Transformer
10	Research On Semantic Analysis Method Of Market Stall Monitoring Video Based On Multi-scale Feature Fusion