Font Size: a A A

Research On Video Classification Method Based On Multimodal Temporal Information Modeling And Fusion

Posted on:2024-05-07Degree:MasterType:Thesis
Country:ChinaCandidate:M J QiFull Text:PDF
GTID:2568307076974779Subject:Master of Electronic Information (Professional Degree)
Abstract/Summary:PDF Full Text Request
With the continuous development of the Internet era and the iteration of network technology,video has become one of the primary ways for people to consume information,relax,work and study.With the exponential growth of video content,efficient classification and retrieval has become an urgent problem.However,since most people do not label their videos when uploading them,manual video classification and retrieval have become extremely cumbersome tasks.Therefore,designing a video classification method to automatically predict the category of various videos is particularly necessary.Currently,video classification methods are mainly divided into two types: single-modality video classification and multi-modality video classification.Multi-modality video classification methods are superior to single-modality ones as they provide more information,which better reflects the content of the video.Nevertheless,most multi-modality video classification methods suffer from some inadequacies.For example,they do not pay enough attention to the use of multi-modal temporal information.Since multi-modal temporal information is usually related,if the temporal relationship between multi-modal information is not fully modeled,not only can the temporal relationship between multi-modal information not be effectively exploited,but also the temporal context information cannot be correctly understood,leading to insufficient understanding of multi-modal information and thus incorrect classification decisions.In addition,their fusion methods are too simple,only utilizing early fusion methods based on features or late fusion methods based on decisions,which cannot fully explore the interaction information between multiple modalities and cannot accurately reflect the content of the video.The fusion methods of different modality feature vectors are only based on simple connection,addition,or multiplication,which cannot capture deeper modality representation information and reflect the complementarity,consistency,and connection characteristics between different modalities,thus cannot fully reflect the content of the video.As a solution,this paper proposes a video classification method based on multi-modality temporal information modeling and fusion.The method first supplements different modal features with feature encoding,resulting in a feature vector encoding multiple types of information,including position information,modal information,and verb noun markers,among others.Then,the audio-visual Transformer is used for multimodal temporal information modeling.Afterward,unique and effective fusion methods are used to fuse the features of different modalities,extracting valuable information from the same modality and complementary information from different modalities.Finally,the model is trained using contrastive learning loss to enhance its performance.Experiments conducted on the public dataset EPIC-KITCHENS-100 and EGTEA demonstrate the effectiveness and innovation of this approach.
Keywords/Search Tags:video classification, multimodal temporal information modeling, multi-modality information fusion, transformer, contrastive learning
PDF Full Text Request
Related items