Font Size: a A A

Research On Speech Separation Based On Multi-Modality Fusion

Posted on:2023-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:C L WangFull Text:PDF
GTID:2568307061453744Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The goal of speech separation is to extract the target speech from the mixture with multiple speakers and interference signals.It is one of the research topics in the field of signal processing with a wild range of applications in video conference,intelligent human-computer interaction and digital hearing aids.At present,most of the existing speech separation models are trained on English datasets,which are not suitable for Chinese scenarios.And for a long time,speech separation has been treated as an audio-only issue.However,due to the observation of the speaker’s facial motion is helpful for speech perception,thus introducing visual modality information contribute to separation performance especially in multi-talker environments.But most of the existing multi-modality fusion methods are simple concatenating and the correlation between modalities is not fully considered.In response to the above problems,the thesis mainly researches on joint audio-visual speech separation based on time-frequency domain and timedomain.The main contributions are as follows:(1)A noval Chinese multi-modality dataset MOOC-Speech is constructed,and an automatic processing tool of constructing dataset is developed.The video is collected from the teaching video on the website of "MOOC".The MOOC-Speech includes both auditory modality and visual modality.The visual modality includes face and lip movement.In total,MOOC-Speech contains roughly 310 hours of video segments with 108 distinct speakers,spanning a wide variety of accents.The number of male and female is 57 and 51 respectively,which exceeds the scale of most existing speech datasets and has the advantages of conforming to the real scene,audio diversity and modality diversity.The results of MOOC-Speech on the classical speech separation models(DPCL,u-PIT,Conv-Tas Net,Pixel Player)and the comparison with the public datasets can prove that it has the applicability and effectiveness for the speech separation task.Meanwhile,the scale of MOOC-Speech is helpful to train the speaker-independent model(training once,applying for any speaker).(2)The speech separation modal in the time-frequency domain is based on U-net.The model includes lip motion analysis network,facial feature extraction network and speech separation network,in which lip motion and face image are used as visual information.And the cross-modal correspondence(CMC)loss is proposed to strengthen the correlation between audio and visual information.Compared with using only static face as visual information,the addition of lip motion achieves a 14.6% performance improvement on the metric SDR,and the using of CMC loss further improves the results by 1.4%.In addition,the modal achieves speaker-independent speech separation.(3)For speaker-independent speech separation in time domain,the audio-only speech separation model based on Conv-Tas Net is improved and lip motion is added as visual assistance.A cross-modal feature fusion method based on attention is proposed.The auditory features are used to find the most relevant visual features,so that the model can learn the weight distribution of the importance of different modalities.The coupling relationship between auditory and visual information is effectively used to improve the separation performance.The experimental results illustrate that compared with the baseline Conv-Tas Net,the multi-modality speech separation is improved by 5.14% on the metric SI-SDR,and the feature fusion method based on cross-modal attention is improved by 3.3% on the metric SI-SDR compared with the feature fusion of simple concatenating.
Keywords/Search Tags:Speech separation, Multi-modality fusion, Multi-modality datasets, Attention mechanism
PDF Full Text Request
Related items