Font Size: a A A

Research On Deep Multi-modal Enhanced Representation Learning And Its Applications For Micro-videos

Posted on:2022-12-28Degree:MasterType:Thesis
Country:ChinaCandidate:L J ZhangFull Text:PDF
GTID:2558307154975869Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
For the past few years,micro-videos have been well received as one of the representative user-generated content in social networks.Micro-videos have limited duration but rich multi-modal information.Although exploiting multi-modal information can benefit various applications such as micro-video classification,retrieval,and personalized recommendation,it is restricted by the noisy and incomplete modality problems.In this thesis,from the perspective of deep multi-modal enhanced representation learning,the following two studies are conducted:Aiming at solving the inaccurate textual information and excessive noise information,this thesis proposes a multi-modal semantic enhanced representation network for complex event detection of micro-videos.On the one hand,we extract the semantic attributes i.e.,adjective-noun pairs from visual modality,and learn the joint representations between semantic attributes and the text modality based on a multigranular semantic joint representation network.On the other hand,we construct the semantic private representation network to learn the enhanced representation of microvideos in terms of private domain meanwhile maximizing private visual and text representations.Finally,considering the complementarity of different modalities,the joint representation and the modal private domain enhanced representation are directly fused to obtain the enhanced representation of the micro-video.The constructed microvideo event detection dataset from the Flickr platform is used to evaluate our proposed network and the effectiveness of the algorithm is verified.Aiming at the issues of the lack of micro-videos modalities and the unclear label dependency,this thesis proposes a multi-modal aggregation attention network for micro-video multi-label classification.Firstly,towards the commonly missing modality issue,a multi-modal information aggregation mechanism is designed,and multiple modal combinations centered on visual modality are constructed according to possible modal missing scenes,and the information aggregation module and autoencoderdecoder module are used to fully explore the complementarity and consistency between different modalities and visual modality.Secondly,in order to better capture the label dependency,the attentive graph neural network is designed to adaptively learn the label correlation matrix and label feature representation.Finally,a cross-modal multi-head attention network is developed to obtain the micro-video representation with enhanced category information.Experiments conducted on a large-scale micro-video dataset demonstrate the superior performance of our proposed network compared with stateof-the-art methods.
Keywords/Search Tags:Micro-video, Complex Event Detection, Multi-label Classification, Multi-modal Learning
PDF Full Text Request
Related items