Font Size: a A A

Research On Characteristics Of Speech Signal For Single Channel Speech Enhancement

Posted on:2022-02-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:C LiFull Text:PDF
GTID:1488306350488704Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Speech enhancement technology is an important guarantee for the human-computer interaction system to work well in the complex acoustic environment.According to the number of microphones,speech enhancement technology can be divided into single-channel and multichannel speech enhancement.Compared with multi-channel technology,single-channel technology does not depend on the accuracy of the microphone,and is widely studied for its ease of use and effectiveness.The mainstream single-channel methods are usually based on the different features of speech and noise in the time domain and transform domain.However,these methods rely too much on the accuracy of noise estimation,and their robustness is restricted under the condition of low SNR.To address the above problems,starting from the frequency domain,modulation domain,and time domain,the dissertation studies the sparsity,low rank,and harmonic characteristics of the speech signal,and designs a variety of speech enhancement algorithms.The main research contributions and results are given as follows:(1)Aiming at the problem that the frequency-domain single-channel speech enhancement algorithm depends on the accuracy of noise estimation,an approximate message passing with k-nearest neighbor sparsity pattern learning based single-channel speech enhancement scheme is proposed.This scheme achieves speech enhancement by making use of the sparsity difference of speech signal and background noise signal in the frequency domain.Firstly,the prior sparse distribution of speech is modeled by the Bernoulli-Gaussian distribution,and the iteratively updating of the approximate message passing algorithm is used to obtain the optimal posterior probability of enhanced speech.Then,with the correlation between adjacent coefficients in each frame,the k-nearest neighbor method is used to improve the sparsity accuracy of each frame.Finally,the proposed scheme can still achieve a better speech enhancement effect in the absence of prior conditions,such as noise estimation and complete sparsity.Experimental results show that,in comparison with the noise estimation-based Wiener filter at 5 dB SNR,the perceptual evaluation of speech quality(PESQ)of the proposed scheme is improved by about 0.08.(2)Aiming at the problem that poor speech perceptual quality of the modulation-domain algorithm at low SNR stationary noise scene,a modulation improved frame-iterative spectral subtraction-based singlechannel speech enhancement scheme is proposed.The proposed scheme makes full use of the features that the distribution of speech signal in the modulation domain is more concentrated than that in the frequency domain,and the difference between the distribution of speech and noise is more obvious for denoising.Firstly,with the inter-frame correlation of speech,a scheme that combining double threshold-based compressed sensing technology and frame-iterative spectral subtraction is designed.Secondly,to further suppress artifacts,a segment SNR-based mask scheme is developed.Finally,the anti-noise ability of the speech enhancement system is improved while the frame number of the reconstructed speech signal is greatly reduced.Experimental results show that compared with the modulation domain spectral subtraction,the average segSNR of the proposed algorithm is improved by about 1.08 dB,and the average PESQ score is improved by about 0.09 in the white noise environment with an input SNR of-5 dB.(3)Aiming at the problem of poor objective intelligibility which is made by the traditional time-domain single-channel speech enhancement algorithm in the non-stationary noise scene,an adaptive low-rank matrix decomposition based single-channel speech enhancement scheme is proposed.The proposed algorithm exploits the difference of low-rank of the speech signal and noise signal in the time domain.As a consequence,it can effectively avoid the serious degradation of speech enhancement performance,which is caused by replacing the clean speech phase with the noisy speech phase in the modulation domain and frequency domain.Firstly,the maximum correntropy criterion-based low-rank matrix factorization is used to suppress multiple types of noise.Secondly,to address the limitation caused by using fixed rank value,an energy threshold method is designed to adaptively update the value of effective rank to obtain clean speech components.Experimental results show that compared with the constrained low-rank and sparse matrix factorization based speech enhancement approach,the proposed algorithm improves the performance of average segSNR,average PESQ,and average shorttime objective intelligibility(STOI)by about 0.34,0.15,and 0.07,respectively,in the babble noise scene with an input SNR of 0 dB.(4)Aiming at the effectiveness of the fusion features of time domain and modulation domain in deep learning,a temporal convolutional network model based on the auxiliary feature of modulation domain is proposed.This model effectively exploits the harmonic characteristics of speech for denoising.Firstly,the time domain feature can compensate the phase information of the modulation domain feature,and the modulation domain feature can fully express the characteristic of speech attribute,so as to improve the accuracy of reconstructed speech.Secondly,for the problem of model convergence and gradient vanishing,an improved network structure is designed by decomposing the gated convolutional unit and carrying dilated convolutional mechanism which is equipped with both skip-connection and residual-connection.Finally,the overall performance of the speech enhancement system is improved.Experimental results show that compared with the reference temporal convolutional network model-based speech enhancement scheme,the proposed model improves the average PESQ and STOI by about 0.2 and 0.13,respectively,in the babble noise environment with an input SNR of 0 dB.
Keywords/Search Tags:single-channel speech enhancement, characteristics of speech signal, approximate message passing, modulation improved frame-iterative spectral subtraction, adaptive low-rank matrix decomposition, temporal convolutional network
PDF Full Text Request
Related items