Font Size: a A A

Deep Learning Speech Enhancement Technology Considering Time-frequency Features

Posted on:2024-08-31Degree:MasterType:Thesis
Country:ChinaCandidate:D H ZhangFull Text:PDF
GTID:2568307100962339Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The development of artificial intelligence technology has promoted the large-scale application of intelligent speech processing technology represented by speech recognition.Speech has become an important way of human-computer interaction in the era of artificial intelligence.In practical applications,speech signals will inevitably be disturbed by noise.Noise can significantly reduce the accuracy of speech recognition algorithm and semantic transmission.Using speech enhancement to recover clean speech signals from noisy speech is an important way to guarantee the performance of speech recognition algorithms and improve the clarity of human hearing.The traditional speech enhancement algorithms represented by Wiener filter,Kalman filter and signal subspace are difficult to suppress non-stationary noise effectively.Although a series of deep learning speech enhancement methods,represented by convolutional neural networks(CNN),recurrent neural networks(RNN)and generative adversarial networks(GAN),have emerged in recent years,can effectively improve this defect,these classical deep learning methods do not consider making full use of the time dependence,context information and other characteristic elements of speech signals.It also has the defects of complex network structure and large number of parameters.To solve the above problems,this thesis explores the speech enhancement technology based on GAN network,and proposes several improved GAN speech enhancement models by considering the different characteristics of speech in time domain and time frequency domain.My main work is as follows:(1)In this chapter,we design and implement the generation adversarial speech enhancement by integrating gated loop unit and self-attention mechanism.Aiming at the problem that the current Generative Adversarial Networks(GAN)do not make full use of the temporal correlation,global correlation and other properties of speech feature sequences,Gated Recurrent Unit(GRU)and self-attention were combined to construct a time modeling module in series and parallel to generate temporal correlation and context information of captured speech feature sequences in adversal networks.Compared with the baseline algorithm on the same data set,the enhanced speech generated by GAN that fused GRUs and self-attention had a 4% improvement in the auditory estimation of speech quality(PESQ)score and performed better on several other objective evaluation measures.The experimental results prove that the performance of speech enhanced GAN network can be further improved by paying more attention to the time series features of time domain speech.(2)Design and implement the generation and antagonistic network voice enhancement based on dual path Transformer.Most of the existing GAN speech enhancement methods operate directly in the time domain,or use the amplitude spectrum in the time-frequency domain to enhance speech,and these methods lack the direct optimization of speech phase.Aiming at the optimization of speech phase,a Gated Linear Unit(GLU)and Dual-Path Transformer(DPT)based GAN speech enhancement network architecture is proposed.This model can process both amplitude and phase information in the time-frequency domain of speech signal.In the model,the application of GLU and DPT can help the network to better extract the time series features and global features of speech.The generator part of the model structure follows the design of the autoencoder architecture.The structure of the double decoder can map the real part and the imaginary part of the spectrogram respectively.Experimental results show that the proposed GAN model achieves excellent noise reduction on Voice Bank+DEMAND data sets by considering the phase and amplitude information of speech.On the same test set,the objective speech intelligibility and quality of the proposed model are better than most existing speech enhanced GAN networks.(3)In this chapter,a low-parameter dual-path Transformer speech enhanced GAN network is designed and implemented.At present,speech enhancement methods based on GAN network have the problems of complex model and too many parameters.To solve this problem,a low-parameter dual-path Transformer voice enhanced GAN network is proposed.The codec of the network uses hollow convolution to enlarge the convolution receptive field and dense connections to realize cross-layer transmission of information,which greatly reduces the number of network model parameters.A Dual-Path Transformer(DPT)module and a mode masking module are placed in the middle of the codec,which can be used to extract multidimensional features of the speech signal and better control the transmission of effective feature information to the decoding layer.Experimental results prove that the proposed low-parameter dual-path Transformer voice enhanced GAN network can maintain excellent performance in various speech evaluation indexes under the condition of low parameter.
Keywords/Search Tags:Speech enhancement, Deep learning, Generative adversarial network, Feature extraction, Self attention
PDF Full Text Request
Related items