As a simple medium for data communication,speech is frequently used in daily life.Speech signals are widely used in speech processing systems,such as hearing aids,speech recognition,portable applications,and the like.However,the noise in the environment or the real world will degrade the quality and intelligibility of the speech signal.Therefore,in a single-channel speech enhancement(SE)framework,estimating clean speech signals from noisy speech signals is a difficult and challenging task.Because in some cases,a large part of the noise is non-stationary and may have speech-like characteristics.So,there is always a requirement for the concealment of non-stationary noises.The purpose of the SE algorithm is to improve the quality and intelligibility of speech without significantly degrading it by suppressing interference noise.The traditional SE methods use the short-time Fourier transform(STFT),which divides the time-domain input signal into sufficiently small segments and considers the signal of each segment to be stationary.For this purpose,it needs a window function.If a narrower window is used,the best time resolution is obtained and safeguards the suspicion of stationarity,yet the frequency resolution is poor.Furthermore,if it considers a more extensive window,it shows signs of improvement in frequency resolution,yet worse,the presumption of stationarity and the time resolution is weak.This time-frequency resolution problem is the first problem of STFT because we cannot precisely know what frequency exists at what time instance,and it is solved by using wavelet-based transforms at an acceptable level for multi-resolution characteristics.The second problem with traditional SE methods is that they only enhance the noisy magnitude spectrum and reconstruct the enhanced speech signal from the enhanced magnitude spectrum and noisy phase.Therefore,the denoising effect of this improved speech signal is not very obvious.Our goal is to achieve a proper balance among these problems using wavelet transforms,which decomposes the time domain signal into low-frequency and high-frequency components,where the low-frequency and high-frequency components correspond to an approximation and detail coefficients,respectively.In the first work,a novel single-channel SE method is proposed,which uses stationary wavelet transform(SWT)and non-negative matrix factorization(NMF)as well as concatenated framing process(CFP)and subband smooth ratio mask(SSRM).It uses SWT to overcome the shift variance property of the discrete wavelet packet transform(DWPT)and then applied NMF to decompose the subbands.Before NMF,the CFP and autoregressive moving average(ARMA)filters are used to perform smooth decomposition and make speech more stable and standardized.The primary estimated signal passes through SSRM,which is consists of standard ratio mask(SRM),square root ratio mask(SRRM),and normalized cross-correlation coefficient(NCCC)to take advantage of them.The algorithm’s performance is evaluated by using the IEEE corpus and different types of noise.By applying this method,the objective speech quality and intelligibility recover significantly and outperforms related methods,such as conventional STFT-NMF and DWPT-NMF.In the second work,a dual-tree complex wavelet transform(DTCWT),and NMF-based SE method is proposed,which utilized SSRM through a joint learning process.DTCWT is used to solve the shift variance and redundancy issues of DWPT and SWT,respectively.It also calculated the ratio mask(RM)between noise and noisy speech.Simultaneously learn the RM’s of the corresponding clean speech training data and noise training data.Before NMF,the ARMA filtering process is utilized for smooth decomposition.An SSRM is proposed,which takes advantage of the combined use of SRM and SRRM.Considering the small training data,fewer iterations,and limited redundancy,our proposed method can work well.The objective metrics of systematic reviews show that the proposed method improves speech quality and intelligibility under severe noisy conditions.Also,in the case of low SNR,it is better than the DNN-IRM scheme in terms of STOI and PESQ scores,because DTCWT decomposes the input signal into a set of subband signals with high time-frequency resolution.A good time-frequency resolution means that the high-frequency components of the signal contain good time resolution,while the low-frequency components retain good frequency resolution.As a result,the speech signal is sufficiently estimated from the noise signal via NMF.In the case of unknown noise,it is significantly better than existing SE methods.In the third work,a novel single-channel SE strategy is established,which uses a double transformation composed of DTCWT and STFT and sparse non-negative matrix factorization(SNMF).The first transform belongs to the DTCWT,which is used for an input signal to overcome the signal distortion caused by the down sampling of the DWPT and transfer a set of coefficients.The second transform is STFT,which applies STFT to each coefficient and generates a complex spectrogram.SNMF is used for each magnitude spectrogram to extract speech components.Since DTCWT uses a filter to separate the high-frequency and low-frequency components of the time-domain signal,and STFT can accurately mine the time-frequency component,it can improve the quality of the estimated speech and eliminate the distortion caused by SE processing.It is evaluated using different evaluation metrics,including HASQI,HASPI,PESQ,STOI,fwsegSNR,and SDR.The experimental results confirm that under noisy conditions,the overall performance of the proposed SE technique is superior to the STFT-SNMF,STFT-GDL,and STFT-CJSR methods.In the case of unknown noise,the proposed approach mostly beats the STFT-SNMF,STFT-GDL,and STFT-CJSR methods under all SNR conditions.In the fourth work,a dual-domain SE method is offered that is jointly learning the real,imaginary,and magnitude parts of the signal using a generative joint dictionary learning(GJDL)algorithm for SE.In the first step,it applies the DTCWT to the time domain signal to decompose it into a set of subband signals.Then,it uses the STFT on each subband signal to obtain the real part,imaginary part,and magnitude of each subband signal,and preserve the phase part for further processing.It utilizes the GJDL approach to prepare the joint dictionary,and then use the batch least angle regression with a coherence criterion(LARC)algorithm with a consistent standard for sparse coding.Get an initial estimate and combine the real and imaginary parts.A subband binary ratio mask(SBRM)is applied to form a signal,and the enhanced magnitude part with the phase becomes the second signal.Since the two signals are obtained from the above processing have different accuracy,they are combined by using the Gini index to generate the final estimated clean speech signal.Compared with the available algorithms in all considered evaluation indicators,the proposed algorithm has the best performance. |