Research On Key Technologies For Multi-source Separation With Deep Neural Networks

Posted on:2022-10-18

Degree:Doctor

Type:Dissertation

Country:China

Candidate:W Zhang

Full Text:PDF

GTID:1528307169476804

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Speech separation is the task of extracting multiple speech signals,one for each speaker,from a mixture containing two or more voices,which is a typical multi-source separation problem.As an important research direction in the field of speech signal processing,speech separation technology has been widely used in the application of automatic speech recognition,military interception,home automation,mobile voice communication,and so on.In recent years,due to its powerful hierarchical non-linear processing capabilities,deep learning technology can extract spatiotemporal structure information of speech signals and learn the deep abstract features automatically,which is a novel and promising way to solve the speech separation problem.The study of incorporating deep learning technology to deal with the speech separation task has become a focused area in the field of speech signal processing,with great academic value and practical significance.At present,the existing researches commonly operate in the time-frequency domain by enhancing and separating the magnitude response and leaving the phase response unaltered,but fail to make full use of the phase and spatial information to improve the performance of source separation.In this dissertation,we focus on solving the problem of speakerindependent speech separation based on deep neural networks(DNNs)technology,under the framework of supervised learning.The main work and contributions include:Aiming at the problem that the existing speech separation methods in the timefrequency domain do not solve the phase enhancement issues when reconstructing the signals,causing limited separation performance,a monaural speaker-independent speech separation method employing complex ideal ratio masking(pcIRM)and a monaural speakerindependent speech separation method based on shifted real spectrum mask(pSRSM)are put forward and discussed.Complex ideal ratio mask and shifted real spectrum mask both encode magnitude and phase information of speech signal simultaneously.These two mask targets are estimated with a Y-shaped and an I-shaped bidirectional long shortterm memory networks,respectively,and their loss function is the mean-square error of the target masks and the estimated masks.In addition,the utterance-level permutation invariance training strategy is applied to solve the label permutation problem.The experimental results show that,compared with the existing masking-based methods,the comprehensive performance of the pcIRM method and the pSRSM method is better.Specifically,the pcIRM method has outperformed the state-of-the-art real-valued network methods in terms of the SDR and PESQ evaluation metrics,while the pSRSM method achieves comparable performance to the pcIRM in the opposite gender speakers source separation circumstance with smaller model complexity.Aiming at the problem of unrecoverable phase information,caused by the unbalanced training of the real and imaginary branch networks in the real-valued network when predicting complex-valued mask targets,an end-to-end monaural speaker-independent speech separation method based on deep complex U-shaped network(uCSA)is proposed.On the one hand,a novel deep complex U-shaped network model is utilized to estimate the complex ideal ratio mask target,and then the SI-SNR loss function is applied to model the waveform speech signal directly.on the other hand,the signal approximation method is adopted when reconstructing the target speech spectrum.In addition,we reformulate the STFT and i STFT operations as learnable modules to make end-to-end training.The experimental results demonstrate that the performance of uCSA method has been improved significantly in PESQ and STOI evaluation metrics compared with the existing methods with real-valued DNNs.Therefore,the magnitude and phase components of the target speech signal can be estimated more effectively.To solve the problem that the existing DNN-based multi-channel speech separation methods do not take advantage of time-frequency domain and spatial information of signals,an end-to-end multi-channel target speech separation method based on temporalfrequency-spatial features extraction as well as beamforming is established,called cTFSPMWF.This method integrates the complex ideal ratio mask estimate with the PMWF adaptive beamformer.First,the temporal-frequency-spatial multi-dimensional features of multi-channel observation signals are extracted as the network input.Then,given the directional features of the target speech signal as prior knowledge,we implement a deep complex U-shaped network to predict the complex ideal ratio mask for the target speech.Next,the covariance matrices of noise and target speech are estimated following the mask estimation.Finally,the filter parameters of the PWMF adaptive beamformer are calculated to realize the reconstruction of the target speech signal.The experimental results indicate that the cTFS-PMWF method not only obtains better separation performance but also does not depend on the geometry of the microphone array.For the problem that traditional underwater acoustic source separation methods usually rely on various artificial assumptions when modeling,a supervised monaural underwater acoustic source separation method based on amplitude spectrum approximation(uMSA)has been presented,which is driven by a labeled underwater acoustic source dataset.Specifically,the spectrum magnitude mask of the target signal is estimated with a multi-layer bidirectional long short-term memory network,and the mean square error loss function based on signal approximation is applied.In addition,we establish the ShipsEar-2mix dataset for the underwater source separation task and the ShipsEar-org dataset for the underwater target recognition task.The well-trained underwater target recognition model is applied as the application-level evaluation indicator for the uMSA method,except for the signal-level evaluation metric SDR.The experimental results confirm that the uMSA method can realize the separation of underwater acoustic sources effectively and improve the quality of underwater sound signals.This work is the first attempt and meaningful exploration for incorporating deep learning technology to solve the underwater acoustic source separation task.

Keywords/Search Tags:

Speech separation, Deep neural network, Time-frequency mask, Beamforming, Multi-channel, Underwater acoustic source separation

PDF Full Text Request

Related items

1	Study On The Underdetermined Speech Separation Based On Deep Neural Network
2	Algorithm Research On Blind Separation Of Speech Signals
3	Machine Learning For Underdetermined Speech Separation
4	Rsearch And Implementation Of Single Channel Speech Separation With Unknown Number Of Speakers
5	Study On Blind Source Separation Of Underwater Acoustic Signals
6	Speech Enhancement And Separation Based On Deep Neural Networks
7	Multi-speaker Speech Separation Based On Deep Learning
8	Research On Speaker Speech Separation In The Scene Of Wearing A Mask
9	Research On Blind Source Separation Of Underwater Acoustic Signal With Time-frequency Analysis
10	Blind Separation Of Multiple Speech Signals