Font Size: a A A

Research On Hierarchical Cross-modal Text-Video Retrieval

Posted on:2023-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z R DingFull Text:PDF
GTID:2568306917479234Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the wake of high-speed development in mobile internet and wide-spread popularization in digital streaming media,video,as a common data carrier,the threshold for making has been lowering and the effects of communication has been improving constantly,and therefore the quantity of video data is showing a blowout growth trend on the internet.Massive video data is closely bound up with people’s production and life,and contains rich economic value and social value.How to retrieve videos efficiently and accurately has become one of the foci in current academic research and commercial development.The text-video cross-modal retrieval method,which employs natural language text description to retrieve videos breaks through the technical bottleneck of the traditional video retrieval method that uses keywords,and it is capable of expressing the user’s search intents for videos more comprehensively and accurately.It is an effective technical way to solve video retrieval issue.Existing such text-video cross-modal retrieval methods map the entire text and video to a common joint semantic embedding space generally,to generate common representations of consistency for texts and videos,so that similarity calculations between text and video can be converted into vector distance operations between data feature representations.However,there are two problems in the existing text and video cross-modal retrieval methods:First,the existing methods only learn the global features of text and video data,which are unable to effectively capture the semantic details of complex long texts,ignoring learning the temporal features of video data and contextual temporal correlation between text and video.There are not conducive to the realization of cross-modal similarity expression and similarity calculation of two different modal data;second,audio data and video data have a natural and close frame-level temporal accurate correspondence,,but its auxiliary role in video retrieval has not been explored.Combined with hierarchical features and audio auxiliaries,this thesis proposes two crossmodal retrieval methods for text and video,which effectively solve two problems in existing methods.The main work of this thesis is as follows:First,aiming at the problem that the existing methods are unable to represent the temporal semantic details of data effectively,a text-video cross-modal retrieval method based on hierarchical semantic matching is proposed.This method employs the Semantic Parsing Toolkit to parse and build text into a three-layer semantic graph,and performs semanticrelated self-attention graph reasoning on it through a Graph Convolutional Neural Network to obtain three-level text features;three different convolutional neural networks are generated to temporally represent the three-level semantics of global events,local actions,and local entities of video data,and align them with text semantic features to achieve globalto-local text-video multi-level semantic association match.The experimental results show that this method can better capture the semantic details of text and video data,improve the semantic discriminability of cross-modal features,is conducive to fine-grained association matching of text and video with the advantage of higher cross-modal retrieval accuracy..Second,aiming at the problem that the existing methods ignore the auxiliary effect of audio on retrieval,a hierarchical text-video cross-modal retrieval method with enhanced audiospatial attention is proposed.The method extracts noun phrases in the textualized audio data,and exploits them to apply attention weights to the corresponding entity regions in the local entity-level features of the video data to focus on the entity-level features corresponding to the audio to capture salient video features,which eliminates information redundancy,thereby improving the fine-grained semantic distinguishability of video data features.Experimental results show that this method can better capture the exact main entity information of the video,greatly reduce the ambiguity of cross-modal feature matching with better retrieval performance than traditional methods.
Keywords/Search Tags:Video Retrieval, Cross-modal, Hierarchical, Attention Mechanism
PDF Full Text Request
Related items