Speech is one of the important tools for people to communicate and transmit information.However,ubiquitous background noise in the real world often interferes with the acquisition of pure speech.In human-human and human-computer interaction scenarios,the presence of noise can blur the characteristics of speech signals,reduce the objective quality and intelligibility of speech signals,and thereby affect perceptual experience and the accuracy of speech processing systems.In order to make speech communication more smoothly,speech enhancement technology has triggered off a lot of research interest.Relying on filtering theory and digital signal processing techniques,traditional unsupervised speech enhancement algorithms have achieved good results,but they are difficult to cope with non-stationary noise.In recent years,supervised speech enhancement algorithms have attracted more and more attention due to their ability to fully learn the characteristics of speech and noise.The sparse representation based method analyzes the spectral structure of signals using dictionary learning theory,providing a more flexible noise suppression scheme.Dictionaries can be viewed as a more essential set of features to reduce signals’ dimensionality.During the sparse encoding of noisy speech,the speech and noise are projected into their respective dictionary spaces,which thereby promotes the separation of pure speech.Based on conventional methods,this thesis uses the dependency relationship between local groups of signals to characterize the signal spectral structure more accurately,which can train a more discriminative dictionary.In addition,sparse representation method,which has been known for its spectral modeling ability,is combined with deep learning method to conduct a helpful exploration.The main work of this thesis is as follows:Firstly,we propose the speech enhancement algorithm using group sparse representation in modulation domain.Unlike conventional algorithms,the proposed algorithm enhances the subband components of speech signal in modulation domain.Additionally,the group structure of signal is considered.In the preprocessing stage,the modulation transform is implemented to generate modulation amplitude spectrum for sparse representation.In the dictionary learning stage,the frames of the training signal are clustered into a series of groups according to the correlation distance,and the sub-dictionaries belonging to different groups are trained respectively.After that,the sub-dictionaries are spliced into complete structured dictionaries.In the enhancement stage,a group sparsity optimization term is added to the objective function to modularly activate the structured dictionary,which can generate the coefficients with group sparsity.By reducing cross group activation problem,the proposed method can recover pure speech from noisy signal based on predetermined group patterns.Secondly,we propose the speech enhancement algorithm combining sparse representation and encoder-decoder network.Based on the mapping of noisy speech to pure speech constructed by the neural network,the sparse representation feature is introduced to provide more discriminative spectral structure information.Two encoders extract high-dimensional features from noisy speech and sparse representation feature respectively,then these two features are fused to predict the training target of the network.In order to achieve joint optimization of feature extraction and speech enhancement,we use a neural network simulating the sparse nonnegative matrix decomposition algorithm to generate signal’s principal component activation patterns.The network is trained using different time and frequency domain loss functions to study their impact on the performance. |