Research On Video Captioning Based On Semantic Enhancement

Posted on:2024-01-23

Degree:Master

Type:Thesis

Country:China

Candidate:S Li

Full Text:PDF

GTID:2568306941464014

Subject:Computer technology

Abstract/Summary:

Video captioning aims to generate natural language description corresponding to the content of a given video.It has broad application prospects and has received widespread attention in recent years.This thesis focuses on improving the semantic accuracy and richness of model-generated descriptions,and conducts research from three perspectives:visual semantics enhancement,visual semantics and language semantics joint enhancement,and external semantics knowledge enhancement,and carries out experimental verification on the common datasets MSVD and MSR-VTT.The main research work is as follows:(1)A video captioning method based on visual semantic enhancement is proposed to solve the problem of missing detailed visual content in the output description caused by insufficient visual semantic information mining.By modeling the spatiotemporal relationship between object-level region features and frame-level global features,this method assigns different weights to region features and fuses them to generate visual features with detailed information.In addition,to further enhance the semantic expression ability of visual features,contextual visual features are dynamically aggregated at the decoding end and combined with global visual features,and then the results are fused with the detected semantic attribute features.The ablation experiments,comparative experiments and visualization results in the datasets prove the effectiveness of the method,and the output results of the model after visual semantic enhancement can reflect the detailed visual content in the video.(2)A video captioning method based on visual-language association enhancement is proposed to solve the problem of incomplete output description of video content caused by insufficient visual-language association.This method uses visual features to predict corresponding language features,and uses decoding output results to reconstruct visual features,enhancing the correlation between visual semantics and language semantics before and after decoding.In addition,by strengthening the semantic similarity between the generated caption and the reference caption,the generated caption is more semantically close to the human description.The ablation experiment,contrast experiment and visualization results in the datasets prove that the output of the model after the joint enhancement of visual semantics and language semantics can more completely express the complex content of the video.(3)A video captioning method based on the enhancement of external knowledge introduction is proposed to solve the problem that the information-expression of the output description is not rich due to the limited internal knowledge of the model.Based on the limited internal semantic knowledge provided by the video input source,this method additionally introduces commonsense information related to semantics of the video from Wikipedia as a supplementary external semantic knowledge,and the fusion decoding of the two types of knowledge is implemented in the decoder.The ablation experiments,comparative experiments and visualization results in the datasets prove the effectiveness of the method.The leading result of CIDEr,a metric related to human perception,proves that the model can generate more accurate and richer sentences after introducing external knowledge.In conclusion,this thesis explores the impact of visual semantics enhancement,visual semantics and language semantics joint enhancement,and external semantics knowledge enhancement on the video captioning model.Extensive experiments on the common datasets have verified that the proposed methods can enhance the semantic accuracy and richness of the output description.

Keywords/Search Tags:

Video Captioning, Semantic Information, Vision-language Alignment, External Knowledge

Related items

1	Research On Deep Image Captioning Technology With Semantic Guidance
2	Research On Video Description Algorithm Based On Visual Semantic Understanding
3	Research On Cross-modal Semantic Alignment For Vision And Language
4	Research On Video Captioning Based On Semantic Information
5	Image Captioning Theories And Methods
6	A Study On Neural Network-based Natural Language Semantic Representation
7	Research On Semantic-Aware Based Video Captioning
8	Research On Visual Captioning Algorithm For “Visual-Linguistic” Cross-Modal Semantic Alignment
9	Improving Pre-Trained Language Representations With External Knowledge For Spoken Language Understanding
10	Research On The Theory And Method Of Visual Captioning