| Speech signal is the main way of information transmission in human society.In real life,speech may be interfered by other voices or noise.Therefore,in speech signal processing,the purpose of speech separation is to separate a single signal from mixed signals.Speech separation is the front-end processing of speech signals,and plays an important role in speech recognition,natural speech understanding and intelligent interaction.Traditional speech separation is based on signal processing and statistics to separate pure speech by single mode.With the development of multimedia technology,speech signals and video signals appear together.Using visual signals to assist audio signals for speech separation has become a new research field.At the same time,since the acquisition of video signals is not affected by the real acoustic environment,and the facial lip movement is strongly correlated with speech signals,audiovisual speech separation,which uses visual information to assist audio information in multi-mode fusion,has become a new research focus of speech separation.Based on the existing research methods,this thesis proposes a multimodal fusion audiovisual separation method in two scenarios.The main research contents of this thesis are as follows:First of all,the visual features of the existing methods have poor robustness and single audio-visual integration.In this thesis,the interrelation between visual features and audio features is fully considered.By using multi-head attention mechanism,combined with Farneback algorithm and U-Net network,a cross-mode fusion optical flow-audio-visual separation model is proposed.Farneback algorithm and lightweight network Shuffle Net v2 were used to extract motion features and lip features respectively,then affine transformation of motion features and lip features was carried out,and visual features were obtained through time convolution module.In order to make full use of visual information,multi-head attention mechanism was adopted in feature fusion.The visual features and audio features are integrated across modes to obtain the integrated audiovisual features,and finally the integrated audiovisual features are separated by U-Net separation network.Using the evaluation indexes of PESQ,STOI and SDR,the experimental test was carried out in Vox Celeb2 dataset.The results show that compared with pure speech separation network and feature splicing audiovisual separation network,the proposed method improves the performance by 2.23 d B and 1.68 d B,respectively.Secondly,for multi-speaker speech separation,pure speech separation in real scenes is not good separation performance,time frequency audiovisual speech separation phase mismatch.Based on Conv-Tas Net speech separation network,a visual encoder is added to the audio encoder to improve the efficiency and accuracy of the model.Because the receptive field of one-dimensional convolution is smaller than that of speech sequence when TCN is faced with ultra-long speech sequence,complete sequence extraction cannot be carried out.Therefore,inspired by pure speech separation dual-path recursive neural network,DPRNN is used instead of TCN separation network,and a time-domain audiovisual speech separation model based on Conv-Tas Net is constructed.The key points of face were detected by DLIB library,and the lip information was obtained.Then the audio encoder encodes the speech to obtain the audio features,and the visual encoder encodes the lip information to obtain the lip features.In order to fully consider the correlation between visual information and audio information,a cross-modal fusion scheme is used for audiovisual fusion.The fused audiovisual features are input to the separation network DPRNN,and the masked values are output.Finally,the masked value is multiplied by the audio feature and the decoder is used to obtain the separated speech of the multi-speaker.Using the evaluation indexes of PESQ,STOI and SDR,the experimental test was carried out in Vox Celeb2 dataset.The results show that compared with Conv-Tas Net speech separation network and time-frequency audiovisual speech separation network,the proposed method improves the performance by 1.99 d B and 1.44 d B respectively in the case of mixed speech of two speakers.In the case of mixed speech of three speakers,SDR increases by 2.21 d B and 1.58 d B respectively.In the case of mixed speech of four speakers,SDR increases by 2.31 d B and 1.69 d B respectively.It can be seen that using cross-modal attention for feature fusion can make better use of the correlation of various modes,increase the lip movement features,and effectively improve the robustness of video features and the separation effect.In addition,for the multi-speaker scene,the time-domain audiovisual speech separation method based on Conv-Tas Net can not only solve the problem of poor separation performance in the real scene of pure speech separation,but also solve the phase mismatch problem of time-frequency audiovisual speech separation. |