| With the continuous popularization of digital communication terminals and rapid development of artificial intelligence,mobile communications and network communications have become a part of people’s daily lives,but the quality of speech communication has increasingly been limited by ubiquitous environmental noise.In order to improve speech quality,speech enhancement technology has caught extensive attention and made breakthroughs.Feature extraction is a key issue for speech enhancement.The effectiveness of feature parameter extraction needs to be measured by analyzing the synthesized speech,which makes feature parameter closely related to the post-processing for the feature parameter,speech synthesis and speech analysis.Meanwhile,in order to further improve the extracted feature parameters,it is necessary to research the relationship among the feature parameter extraction method and the feature post-processing,speech synthesis,and speech analysis.In view of this,this thesis studies the problems and proposes methods to improve speech quality.The analysis-by-synthesis(ABS)model firstly synthesizes speech from the features through a specific method,and then analyzes the synthesized speech to screen out the expected feature parameter,thereby optimizing the feature parameter extraction and improving the speech quality.This thesis introduces the ABS model into the singlechannel speech enhancement and proposes the ABS-based single-channel speech enhancement framework to improve the extraction of feature parameters.Based on this framework,this thesis focuses on features such as linear prediction coefficients and mask features and research the deviation between the speech synthesized by the feature parameters and the actual auditory perception,so as to optimize the extraction of feature parameters,to improve the speech auditory perception quality,and finally to improve the speech enhancement performance.In this thesis,the main researches and innovations are followed:1.Speech enhancement framework based on analysis-by-synthesis modelThis thesis introduces the ABS model into speech enhancement for the first time to optimize the feature extraction of speech signals.In order to describe the changes of the speech signal characteristics in the speech enhancement process,this thesis defines the concepts of the deep and shallow features of the speech signal,where the deep features refer to those features obtained through complex processing,which is the internal reflection of the speech signal characteristics,and the shallow features refer to those features obtained through simple processing,which is the dominant characteristics of the speech signal,such as the frequency and amplitude spectrum.This thesis firstly describes the speech feature extraction framework based on the ABS model,that is,the relationship between deep features and shallow features is established based on the known synthesis model.Then,the optimization equations are jointly established according to the measurement of deep and shallow features,and the deep features are conducted by a closed-loop approximation.Based on the ABS framework,this thesis introduces the deep neural network(DNN)and proposes the ABS model framework based on DNNs,which lays the foundation for the following speech enhancement methods.2.Auto-regressive coefficient estimation based on analysis-by-synthesis model and deep neural network-based pre-processingFor traditional speech enhancement methods based on the auto-regressive model,since the residual of the speech under the auto-regressive model does not always obey the Gaussian distribution,the target estimation of the auto-regressive coefficient is inaccurate,which affects the construction of the Wiener filter.This thesis proposes an auto-regressive coefficient estimation method of speech signal based on ABS model and DNN-based pre-processing,that is,using the ABS principle to train a DNN that pre-processes the speech signal.The DNN-based pre-processing makes the prediction residuals approximately obey the Gaussian distribution and improves the estimation of autoregressive coefficients.In the Deep Learning training stage,auto-regressive coefficients are used to synthesize spectral envelope.Through comparing the error between the synthesized spectrum envelope and original amplitude spectrum,the convergence of the network is determined,and the closed-loop optimization of DNN is realized.Experimental results show that the proposed method can effectively improve the accuracy of auto-regressive coefficient estimation.3.Speech enhancement method based on part-defined auto-encoder networkThe existing codebook-driven speech enhancement methods cannot accurately estimate line spectrum frequency parameters in a noisy environment.Therefore,the constructed Wiener filter has a large error compared with the ideal filter coefficient,which affects the enhanced speech quality.Combining with the ABS model framework based on DNNs and based on the codebook-driven speech enhancement method,this thesis proposes a part-defined auto-encoder network to realize speech enhancement.The speech enhancement method uses the principle of DNN-based ABS to train the encode network in the part-defined auto-encoder,thereby obtaining approximate line spectrum frequency parameters and corresponding Wiener filter coefficients,and improving the quality of the enhanced speech.In the training stage,there are three errors are compared for determining the network convergence and achieving closed-loop optimization of DNN,i.e.,the error between the estimated Wiener filter coefficients and the ideal ones,and the errors between the estimated linear spectrum frequency parameters of speech or noise and the ideal ones.The experimental results show that the proposed method effectively improves the estimation of line spectrum frequency parameters,spectral envelope characteristics and the quality of the enhanced speech to some extent.4.Approximate estimation method of noise masking based on power exponent weightingIn phase-independent masking-based speech enhancement methods,the estimation errors of the mask and amplitude spectrum are often used to evaluate the convergence of DNNs,but few people discuss the inherent relationship between the mask and amplitude spectrum.This thesis proposes a power exponent weighting criterion based on the amplitude spectrum of noisy speech,which is used to compare the error between the estimated mask and the ideal mask.Meanwhile,the DNN-based mask approximation and the indirect mapping of the amplitude spectrum are unified into a generalized amplitude mask based on the short-term speech spectrum(GAMSTSA).Experimental results show that the power exponent coefficient is highly related to enhanced speech quality.When the power exponent is 1,compared to the mask approximate(power exponent coefficient is 0)and the indirect mapping of amplitude spectrum(power exponent coefficient is 2),the harmonic structure and the quality of the enhanced speech are significantly improved.5.Masking feature extraction method based on analysis-by-synthesis modelFor the DNN-based mask approximation methods,since target masks are based on the maximum output signal-to-noise ratio(SNR)and not based on the maximum perceptual quality of the enhanced speech,the perceptual quality of enhanced speech is not good enough.In this thesis,an ideal real-value ratio mask(IRVRM)extraction method is proposed based on the ABS model,in which the IRVRM is determined by maximizing the perceptual quality of the enhanced speech.Then,the IRVRM is utilized to train the DNN,so as to improve the perceptual quality of the enhanced speech.The proposed ABS method consists of three processes: generation process,synthesis process and analysis process.In the generation process,the masking feature subspace is linearly generated based on the decreasing direction of the mean square error of the reconstructed signal.In the synthesis process,the enhanced speech is obtained by inverse short-time Fourier transform(ISTFT)of the masked spectrum of noisy speech,whereas in the analysis process,the IRVRM is conducted by optimization.The experimental results show that when the extracted IRVRM with the ABS process is employed as the training target of the DNN,the perceptual quality of the enhanced speech is effectively improved in the DNN-based mask approximation. |