Research On Multimodal Speech Separation Based On Face Video And Audio

Posted on:2024-04-25

Degree:Master

Type:Thesis

Country:China

Candidate:P W Jiang

Full Text:PDF

GTID:2558306920454104

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

Speech signal is the main way of information transmission in human society.In real life,speech may be interfered by other voices or noise.Therefore,in speech signal processing,the purpose of speech separation is to separate a single signal from mixed signals.Speech separation is the front-end processing of speech signals,and plays an important role in speech recognition,natural speech understanding and intelligent interaction.Traditional speech separation is based on signal processing and statistics to separate pure speech by single mode.With the development of multimedia technology,speech signals and video signals appear together.Using visual signals to assist audio signals for speech separation has become a new research field.At the same time,since the acquisition of video signals is not affected by the real acoustic environment,and the facial lip movement is strongly correlated with speech signals,audiovisual speech separation,which uses visual information to assist audio information in multi-mode fusion,has become a new research focus of speech separation.Based on the existing research methods,this thesis proposes a multimodal fusion audiovisual separation method in two scenarios.The main research contents of this thesis are as follows:First of all,the visual features of the existing methods have poor robustness and single audio-visual integration.In this thesis,the interrelation between visual features and audio features is fully considered.By using multi-head attention mechanism,combined with Farneback algorithm and U-Net network,a cross-mode fusion optical flow-audio-visual separation model is proposed.Farneback algorithm and lightweight network Shuffle Net v2 were used to extract motion features and lip features respectively,then affine transformation of motion features and lip features was carried out,and visual features were obtained through time convolution module.In order to make full use of visual information,multi-head attention mechanism was adopted in feature fusion.The visual features and audio features are integrated across modes to obtain the integrated audiovisual features,and finally the integrated audiovisual features are separated by U-Net separation network.Using the evaluation indexes of PESQ,STOI and SDR,the experimental test was carried out in Vox Celeb2 dataset.The results show that compared with pure speech separation network and feature splicing audiovisual separation network,the proposed method improves the performance by 2.23 d B and 1.68 d B,respectively.Secondly,for multi-speaker speech separation,pure speech separation in real scenes is not good separation performance,time frequency audiovisual speech separation phase mismatch.Based on Conv-Tas Net speech separation network,a visual encoder is added to the audio encoder to improve the efficiency and accuracy of the model.Because the receptive field of one-dimensional convolution is smaller than that of speech sequence when TCN is faced with ultra-long speech sequence,complete sequence extraction cannot be carried out.Therefore,inspired by pure speech separation dual-path recursive neural network,DPRNN is used instead of TCN separation network,and a time-domain audiovisual speech separation model based on Conv-Tas Net is constructed.The key points of face were detected by DLIB library,and the lip information was obtained.Then the audio encoder encodes the speech to obtain the audio features,and the visual encoder encodes the lip information to obtain the lip features.In order to fully consider the correlation between visual information and audio information,a cross-modal fusion scheme is used for audiovisual fusion.The fused audiovisual features are input to the separation network DPRNN,and the masked values are output.Finally,the masked value is multiplied by the audio feature and the decoder is used to obtain the separated speech of the multi-speaker.Using the evaluation indexes of PESQ,STOI and SDR,the experimental test was carried out in Vox Celeb2 dataset.The results show that compared with Conv-Tas Net speech separation network and time-frequency audiovisual speech separation network,the proposed method improves the performance by 1.99 d B and 1.44 d B respectively in the case of mixed speech of two speakers.In the case of mixed speech of three speakers,SDR increases by 2.21 d B and 1.58 d B respectively.In the case of mixed speech of four speakers,SDR increases by 2.31 d B and 1.69 d B respectively.It can be seen that using cross-modal attention for feature fusion can make better use of the correlation of various modes,increase the lip movement features,and effectively improve the robustness of video features and the separation effect.In addition,for the multi-speaker scene,the time-domain audiovisual speech separation method based on Conv-Tas Net can not only solve the problem of poor separation performance in the real scene of pure speech separation,but also solve the phase mismatch problem of time-frequency audiovisual speech separation.

Keywords/Search Tags:

speech separation, audiovisual Integration, cross-modal attention, optical Flow, U-Net

PDF Full Text Request

Related items

1	Cross-modal Audiovisual Integration Research On Influencing Factors
2	Study On Cross-modal Speech Recognition Methods With Fusion Lipreading
3	Research On Speaker Speech Separation In The Scene Of Wearing A Mask
4	Multi-speaker Speech Separation Based On Deep Learning
5	Speech Enhancement And Separation Based On Deep Neural Networks
6	Research On Image-Text Cross-Modal Matching Based On Attention Mechanism
7	Research On Cross-modal Retrieval Of Speech And Image Based On Deep Neural Network
8	Research On Low Intrusive Multi-modal Speech Separation Methods
9	Research On Multi-modal Word Segmentation Method Integrating Speech Features
10	Research On Multi-modal Speech Separation Based On Audio-visual Combination