Research On Video Captioning Method Based On Encoder-Decoder

Posted on:2023-03-04

Degree:Master

Type:Thesis

Country:China

Candidate:M M Wang

Full Text:PDF

GTID:2568307127983479

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In recent In recent years,with the rapid development of the Internet and multimedia technologies,video data have increased dramatically.In order to facilitate the understanding and selection of the desired videos,the studies on video captioning have gradually received academic attention,which aim to achieve high-level semantic awareness and natural representation of visual contents and have broad application prospects in video retrieval,aided vision,surveillance description and other aspects.In the works of video captioning,the encoder-decoder are mainly used to encode and decode visual inforrmation,thus enabling textual description of video contents.However,such methods only extract visual information from the feature level of video data,and with less consideration of the semantic analysis process of visual features in the utterance description.In addition,the utterances generated from video captioning are too dependent on the tagging information corresponding to the video data,so the encoder-decoder are difficult to generate semantically rich video captioning.To address the above issues,this paper improves the traditional encoder-decoder approach by combining video feature extraction,scene representation construction and the introduction of external corpuses,with the following main research elements.(1)Aiming at the issue of unclear syntactic structure of description statements because of insufficient semantic analysis in video captioning methods based encoder-decoder,a novel video captioning method based on the syntactic analysis of object features in scene representation is proposed.First,the 2D and C3D features of videos,the object features extracted by the Faster RCNN model and the self-attention mechanism of the Transformer are combined in the encoding stage to construct a visual scene representation model to represent the dependencies between the visual features.And then,a visual object feature syntax analysis model is constructed to analysis the syntactic components of the object feature in the visual scene representation in the description statement.Finally,in the decoding stage,the combined grammar analysis results are injected into the LSTM network model to output video captioning.The results show that the proposed method can generate video captioning with a clear grammatical structure.(2)At present,the video captioning methods based encoder-decoder are rely more on a single video input source,and with less consideration of using external corpus information to guide video captioning generation,the generated video captioning have limited semantic information,which is not conducive to the accurate description of video contents.To address this issue,an utterance retrieval generation network(ED-SRG)-based guided video captioning method is proposed.First,this method adopts an encoder-decoder model to respectively extract 2D features,3D features and object features of videos,and decodes the above features to generate the simple description statements.And then,it uses a sentence-transformer network model to retrieve statements in an external corpus that are semantically similar to the above description statements,and by measuring the similarities between statements.Finally,an novel RS GPT-2 network model is constructed,which introduces a designed random selector to randomly select the predicted words with a high probability of occurrence in a corpus to guide and generate descriptive utterances that conform to natural human language expressions for the video data.The proposed method is tested on the public datasets MSVD and MSR-VTT respectively.The results show that the proposed method improves the evaluation indexes of BLEU-4,CIDEr,ROUGE_L and METEOR by 19.4%,13.1%,11.6%and 13.5%,respectively.

Keywords/Search Tags:

Video captioning, Visual scene representation, Grammatical analysis, Statement retrieval, RS GPT-2 model

PDF Full Text Request

Related items

1	Research On The Theory And Method Of Visual Captioning
2	Study On Content-Based Video Analysis And Retrieval
3	Research On Visual Captioning Algorithm For “Visual-Linguistic” Cross-Modal Semantic Alignment
4	Image Captioning Theories And Methods
5	Research On Video Captioning With Visual Content Understanding And Linguistic Information Analysis
6	Research On Intelligent Semantics Generation For Visual Data
7	Spatio-temporal Attention Model For Video Captioning
8	Task-driven Visual Media Text Description Technology
9	Research On Graph-based Visual Scene Representation And Its Applications
10	Hierarchical Visual Semantic Embedding For Image Captioning