| With the development of information technology and society,speech interaction has been widely used in various equipment and application scenario,but in the realword speech related application scenario,the speech signal is often disturbed by complex environmental noise and reverberation,which leads to the decline of intelligibility of the speech,and it will also have an negative effect on the performance of speech related downstream tasks,The basic task of speech enhancement is to find how to effectively reduce noise interference and improve the intelligibility and overall perceptual quality of speech signals.The mainstream monaural speech enhancement models based on encoder and decoder use conbolutional encoder to reduce the dimension of features,then the highlevel semantic features are dimensionalized by decoder to get the target output.In the process of encoding and decoding,skip connections are also used to transfer the feature maps generated by the encoder to the decoder of the same layer,so as to help the decoder recover the denoisd speech better.However,the existing methods don’t make full use of the full-scale features generated in the encoding and decoding process,and the method based on fullband ignore the differences between the local spectral patterns of speech.At the same time,the method based on time-frequency domain only simply combines the phase of the original input signal when reconstructing clean speech,and does not fully consider the phase information of the imaginary part after short-time Fourier transformer(STFT),It restricts the performence of speech enhancement.Considering the above problems,this thesis studies the CRN-based speech enhancement method in time and frequency domain,The main research works are as follows:(1)In order to solve the problem that that the popular encoder and decoder based SE model does not make full use of full-scale features,a full-scale feature connected speech enhancement model FSC-SENet is proposed.Firstly,this thesis constructs a speech enhancement model based on CRN architecture.Convolutional encoder and decoder are used to extract features and recover speech signals,and LSTM modules are used to extract temporal features at the bottleneck of the model.Then a full-scale connection method and multi feature dynamic fusion mechanism are proposed,So that the decoder can make full use of the full-scale features to recover clean speech in the decoding process.Experimental results on TIMIT corpus show that compared with CRN,our FSC-SENet improves PESQ score by 0.39 and STOI score by 2.8% under unseen noise cases,and PESQ score by 0.43 and STOI score by 3.1% under unseen noise cases,which proves that the proposed full-scale connection and dynamic feature fusion mechanism can make CRN have better speech enhancement performance.(2)Aiming at the problem that our full-band model in our thesis ignores the local spectral pattern,the sub-band analysis of speech is introduced,a fullband-subband cascading model and a simplified feature fusion module are proposed,which fuses the noisy speech features with the estimation results of the intermediate stage,so as to help the models in the subsequent stage to better estimate.In our thesis,we conduct experiments on the TIMIT corpus,and the results of experiments show that the fullband-subband cascade speech enhancement model has the highest objective index score compared with other models,which proves that our proposed two-stage speech enhancement model has better enhancement performence than the pure fullband and pure subband models.(3)Considering the speech enhancement model does not make full use of phase which works on time and frequency domain,Based on the two-stage speech enhancement model proposed in the previous work,a deep complex speech enhancement neural network is proposed.,The complex base modules are used to modify each functional module to the complex form,so that the network can operate the complex features of speech.It makes better use of the phase information,instead of the previous method which only uses the amplitude feature of speech to predict the clean amplitude spectrum.The experimental results on TIMIT corpus show that the proposed model has better enhancement performence after using phase information,and the evaluation index scores are better than other benchmark models... |