Font Size: a A A

Research On Speech Enhancement Based On U-Net And Transfer Learning

Posted on:2022-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:P K DouFull Text:PDF
GTID:2518306338967939Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the popularity of intelligent speech devices,the increasing demand for speech denoising has made the role of speech enhancement algorithms increasingly prominent.The speech enhancement algorithm based on deep learning has shown great potential,greatly improving the ability of speech denoising,but there are also many challenges.Firstly,the performance of the model can be improved by introducing DenseNet(Dense Convolutional Network),but there is also a problem that the parameters cannot be fully utilized.Next,the loss function and the objective evaluation metrics do not match,the commonly used loss function,such as MSE(Mean Square Error),can not represent the speech feature very well.Finally,if in a very low signal-to-noise ratio environment,the effectiveness of the model will be reduced and the stability will be worse.The generalization ability of the model is limited,the performance in test dataset will be greatly reduced in the real environment,etc.In order to solve the above problems,we propose a speech enhancement method in time domain based on the U-Net network.Initially,by introducing the RDL network in each encoder layer and decoder layer to improve the parameter utilization efficiency,alleviating the problem of parameter reused.Second,using the attention mechanism to alleviate the long-term dependence.Moreover,for the problem of the mismatch between loss function and evaluation metrics,a joint training loss is proposed,by combining SI-SDR(Scale Invariant Signal-to-Distortion Ratio)loss and the time-frequency domain loss,in order to make full use of time domain and frequency domain information.At last,the PESQ loss is used to fine-tune the model to further improve the quality of the enhanced speech.Experimental results show the ARDAEC(Attention Residual-Dense lattice Auto Encoder Convolutional Neural Network)model has improved the PESQ score at 0 dB by 0.09 and the short time objective intelligibility(STOI)score by 1.4%compared with the DDAEC(Dilated and Dense Auto Encoder Convolutional Neural Network)model,the PESQ scores of the ARDAEC-P model,which was fine-tuned by the PESQ loss,would be increased to 2.76,improved by 1.18.In order to verify the enhanced performance of the proposed model in the real environment,we designed an XMOS-based microphone array board to collect noisy speech in the real environment,thereby verifying the noise reduction ability of the model in the real environment.Experimental results show that,for artificially synthesized babble noise at 0 dB,the WAWEnets score of enhanced speech was 3.45,improved by 2.20,and the NISQA score was 3.22,improved by 1.71.And compared with the DDAEC model,the proposed model achieved better WAWEnets and NISQA scores in the vast majority of noise,had a good capability of speech denoising.In order to further improve the generalization ability of the proposed model,a dynamic domain-adversarial adaptation network has been used to learn the invariance between the source domain and the target domain.Without the parallel speech,only using a small part of test speech,the performance of the speech enhancement method has been significantly improved on the target domain.Experimental results show that,compared with the ARDAEC-P model,the SEDDAN-250 model,which was traind on 30%test set data,has improved PESQ score by 0.08 and STOI score by 2.5%at 0 dB.
Keywords/Search Tags:speech enhancement, transfer learning, U-Net network, attention mechanism
PDF Full Text Request
Related items