Font Size: a A A

Research On Video Captioning With Visual Content Understanding And Linguistic Information Analysis

Posted on:2024-02-21Degree:MasterType:Thesis
Country:ChinaCandidate:X Y ChenFull Text:PDF
GTID:2568307079459394Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuous development of network transmission speeds,human society has undergone tremendous changes: short videos occupy more of people’s spare time,and the online entertainment streaming and e-commerce industries are thriving.Against such a backdrop,cross-media content analysis and understanding have emerged as an area of research with great demand an challenges in the field of artificial intelligence and deep learning.As a cross-modal task that combines visual analysis and language generation,video caption generation needs to start by analyzing the video content and converting humanperceived vision into machine-perceived feature symbols.With the help of a trained language generator,accurate and detailed sentences describing the video content can be obtained.How to better understand the visual content and how to better analyze the language information have become the basic problems of video captioning research.Many studies have made targeted improvements to better understand visual content and text,but there are still some important problems that need to be addressed.For example,in video understanding,how can existing datasets be fully utilized,and how can the connections within the data be mined to support the generation of captions without relying on more expensive annotated data supplements? Furthermore,considering the rapid development and prevalence of large-scale pre-trained models in deep learning,how can one keep pace with the evolving landscape and effectively leverage the knowledge embedded within these pretrained models to serve the specific tasks?To address these problems,this thesis proposes the following corresponding solutions:For internal knowledge mining,this thesis proposes visual representation enhancement based on a support set to better understand visual content and language information in the cross-modal semantic space and enhance visual expression during the learning process.By constructing a support set and establishing a flexible mapping,the model’s learning process is optimized,resulting in more versatile and rich captions.For the mining of vision-language knowledge from external pre-trained models,this thesis proposes a keyword-assisted video caption generation based on the cross-modal pre-trained model CLIP.It advances the understanding of visual content while incorporating and being guided by text modality comprehension.By utilizing the unified visionlanguage semantic space of CLIP and the rich knowledge stored in the large pre-trained model,keywords for visual information are obtained to guide the generation of captions.Finally,detailed experiments are carried out on two datasets corresponding to video caption generation,which prove the effectiveness and advancement of the above methods.
Keywords/Search Tags:Cross Modality, Video Understanding, Captioning, Support Set, Pre-training
PDF Full Text Request
Related items