Font Size: a A A

Transformer Based On Audio-Visual Event Localization

Posted on:2024-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:Z H QiFull Text:PDF
GTID:2568307079471854Subject:Electronic information
Abstract/Summary:PDF Full Text Request
With the rapid development of mobile media technology,short videos enriched with audio-visual information have gradually appeared in the public’s view.Using deep learning to complete audio and video event location has significant economic value for extracting video content,locating the main action interval of the video,and accurately pushing it to users.At the same time,compared with single-modal video understanding tasks,using audio-visual information to represent video resources is more challenging and helps promote artificial intelligence towards general intelligence.In recent years,more and more researchers have noticed that the Transformer can capture global attention in video tasks,which has a natural advantage for resources with strong correlation in temporal actions such as videos.However,the Transformer also has some difficult-to-overcome shortcomings.Firstly,the complexity of the Transformer structure and the number of input sequences show a square relationship,which will bring expensive computational costs.Secondly,when visual modeling with the Transformer,it is difficult to capture small targets in the video due to the lack of receptive fields and other structures,thus losing the understanding of key actions in the video.To address these issues,the main work of the thesis is as follows:(1)The thesis proposes a Transformer network structure based on audio-visual fusion for audio and video event localization tasks.It uses a pre-trained convolutional neural network to extract information from different modal information and uses window-based selfattention mechanisms and cross-modal attention mechanisms to extract advanced representation information of the video between single-modal and multi-modal in stages,which can be applied to different computer vision multi-modal tasks through different decoders.(2)To overcome the problem of insufficient local receptive fields in Transformer in video resources,the thesis introduces the cross-modal Transformer structure into the convolutional neural network and proposes a convolutional-Transformer-based audio and video event localization network.By inserting a Transformer-based cross-modal attention module between different stages of a residual neural network,it improves the ability to capture local features between different modalities in multi-modal resources.Regarding the above work content,the thesis conducted a large number of experiments on the AVE dataset and obtained 77.9% and 78.7% accuracy in supervised experiments,respectively,achieving extremely competitive results.In the ablation experiment of the convolutional-Transformer-based audio and video event localization network,the thesis also introduced the Kinetics dataset into the training process,resulting in a result of79.9%.In addition,the thesis also conducted weakly supervised experiments and several ablation experiments,and discussed in detail the design ideas,principles and the role of different modules,proving the effectiveness of this structure.
Keywords/Search Tags:Deep Learning, Transformer, Attention Mechanism, Audio-Visual Event Location, Computer Vision
PDF Full Text Request
Related items