Font Size: a A A

Research On Natural Language Based Video Retrieval And Localization

Posted on:2023-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhengFull Text:PDF
GTID:2568306791467884Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of multimedia technology and the popularity of mobile smart devices,the number of videos on the Internet is exploding.In this context,about how users can find the needed videos quickly and accurately gradually attracts the interest of researchers.The current product-based video retrieval technology is conducted through keyword matching,which is limited by the quality of keywords,and the absence of keywords or irrelevance to video content will affect the retrieval effect;and the form of keywords itself cannot express complex requirements and video content as flexibly as natural language.On the other hand,despite the task of text-video retrieval,keyword-based retrieval is essentially a text-text unimodal retrieval process because it does not involve the understanding of video content,which is the core reason for its lack of precision and flexibility.To meet users’ requirements for accuracy.The research on natural language based video retrieval and localization has started to attract the attention of academia and industry,which helps users to find the needed content quickly and accurately through cross-modal similarity learning.It consists of two aspects:natural language based video retrieval and natural language based moment localization.Relevant videos are first retrieved from a huge amount of videos,and then the moment that meet the user’s needs are further located within the videos.For natural language based video retrieval techniques,a common solution is to obtain a joint video-text embedding space by cross-modal representation learning,and then compute the similarity between them in the joint space.In this paper,we focus on the learning of video representation in the joint embedding space.The common approach will extract the features of each frame in the video offline by a pre-trained 2D-CNN model,based on which the video representation is learned.Considered that the video can be viewed as a composition of fine-grained object objects and the relationships between them,and the frame-level information may not be learned at a fine-grained level,so based on the use of frame-level features,this paper introduces fine-grained object information in the video,including their visual information and semantic information of categories.The visual-semantic interaction fusion and crossfeature interaction enhancement modules are designed for different modalities and different levels of features,so that the two features complement and enhance each other.Adequate experiments are conducted on both MSR-VTT and TGIF datasets to demonstrate the effectiveness of the model.For natural language based moment localization techniques,existing methods expect to localize the target fragment at once,which increases the difficulty of the task.Considering that there is a process from fast to slow and coarse to fine in human processing localization tasks,this paper expects to localize the target moments gradually from coarse to fine.Inspired by the logic of human thinking,this paper designs a multi-stage progressive localization network based on existing localization methods to realize the coarse-to-fine localization process,and passes the learning information from front to back through conditional feature manipulation module and upsampling connection operation.Experiments on three datasets,TACoS,ActivityNet Captions and Charades-STA,demonstrate the effectiveness of this paper’s model;they also demonstrate the potential of the proposed model for localizing short moments in long videos.
Keywords/Search Tags:language-based video retrieval, fine-grained, feature interaction, moment localization, coarse-to-fine, progressive
PDF Full Text Request
Related items