Research On Multimodal Audio-Visual Separation Model Based On Attention Mechanism | | Posted on:2024-05-28 | Degree:Master | Type:Thesis | | Country:China | Candidate:Y T Zhang | Full Text:PDF | | GTID:2568307160955549 | Subject:Computer Science and Technology | | Abstract/Summary: | PDF Full Text Request | | Audio source separation refers to the separation of individual sound signals from a mixed audio source containing multiple sounds.It can be applied to practical applications such as speech separation,music automation processing and so on.It has very important research significance.The existing audio source separation models can be divided into traditional-based models and deep learning-based models,but they all have some shortcomings.Some models only use audio information and ignore visual information,which lead to data information waste.The network of some models are relatively simple and cannot extract enough feature information.Some models are weakly against noise and cannot focus on more important feature information.The other models ignore the difference between different types of features and fused the features directly which raises the semantic gap problem.To address the above shortcomings,this thesis constructs a multimodal audio-visual separation model based on attention mechanism to better solve the audio source separation problem.To address the problems of simple network model and weak noise resistance,this thesis proposes a multimodal audio-visual separation model based on single-channel attention mechanism.The model takes audible video as the data set and designs two different networks to obtain the feature information of both visual and audio modalities,avoiding the waste of data information.In the visual analysis module,channel attention and spatial attention are introduced and serially connected to build a hybrid domain attention mechanism,which improves the noise resistance of the model and allows the model to focus on the more important feature information in the data while ignoring other distracting information.The full-scale jump connection structure is designed in the audio information module to better connect the shallow features with the deep features so that the model can obtain enough feature information.To address the problem that the model directly fuses different types of feature information,this thesis proposes a multimodal audio-visual separation model based on a dual-channel attention mechanism.The model still selects audible video as the dataset to obtain visual and audio feature information from a multimodal perspective,and serially connects channel attention and spatial attention in the visual channel to build a hybrid domain attention mechanism to enhance the noise resistance of the model.The attention gating mechanism is designed in the audio channel attention module to dynamically fuse high-level features and low-level features in the network by designing weights to avoid the semantic gap problem caused by the direct fusion of feature information.It also reduces the noise generated during the training process.The final output spectrograms and quantitative experiments on the MUSIC-21 and AVE dataset demonstrate the superior audio separation performance of this model compared with previous audio separation models.An audio-visual separation system is also built to better apply the model to practical engineering. | | Keywords/Search Tags: | multimodality, sound source separation, attention mechanism, audio-visual separation, feature information, spectrogram | PDF Full Text Request | Related items |
| |
|