Font Size: a A A

Research On Video Description Method Based On Feature Enhancement And Fusion Strategy

Posted on:2024-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y F BaiFull Text:PDF
GTID:2568307097457504Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Video is one of the main carriers of information transmission in today’s human society,as a multimodal information data medium,the information contained is richer and more diverse than pictures and text.Video description aims to convert the video into text sentences to describe the content information related to the video,and this technology has a wide application prospect in human-computer interaction,assisting visually impaired people and video retrieval.The existing video description method has the problems of inaccurate localization and recognition of features in key areas of video,insufficient feature fusion and insufficient connection between words,resulting in the generated sentences that cannot correctly describe the video content.In view of the above problems,this paper proposes a video description method based on feature enhancement and fusion strategy,and the main research content is as follows:(1)In order to improve the model’s ability to locate the key areas of the video and extract the quality of static object features,this paper proposes an encoder VFE-4 based on feature enhancement.VFE-4 uses the dual attention module constructed by channel attention and spatial attention to construct the correlation between channels,which improves the ability of static feature extraction network to capture important regional features.At the same time,the feature enhancement module is integrated to provide correct detail guidance for the model by using local and global features,amplifying the feature differences of similar objects,and improving the accuracy of the coded features of the target subject.Experimental results show that the quality of the video static features extracted by VFE-4 proposed in this paper is significantly improved,which has a positive effect on the generation of more accurate statements by the decoding network.Compared to the benchmark model,VFE-4 improved by an average of 1.1% and 0.6%on MSVD and MSR-VTT datasets.(2)In order to improve the problems of insufficient feature fusion and insufficient connection between words,based on the encoder based on feature enhancement,this paper adopts three fusion strategies respectively,using the spatial module and the time series module to fully integrate different modal features,establish the correlation between the target subject and the behavioral action,and improve the fusion quality of the overall features.At the same time,in order to make the decoder not only improve the attention to important words,but also make full use of the feature information of previous words,this paper integrates the text attention mechanism TA into the decoder of the STC model,so that the model can predict important words that can better represent the video context information.Experimental results show that the STC1-TA model proposed in this paper has more full feature fusion,the relationship between the target subject and the behavioral pose is more clear,the predicted words can better represent the video context information,and the generated description statement is closer to the label statement.Compared with the optimization model of the basic model in recent years,STC1-TA improves the average of 1.2% and 1.5% on the MSVD and MSR-VTT datasets,and compared with other mainstream model algorithms in the same field,the STC1-TA proposed in this paper is better than the evaluation index of most model algorithms.
Keywords/Search Tags:Video Caption, Encoder-Decoder, Feature Enhancement, Convergence Strategy, Text Attention Mechanism
PDF Full Text Request
Related items