Font Size: a A A

Research On Speech Separation Algorithm Based On Self-attention Mechanism And Speaker Embedding

Posted on:2023-06-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z H ZhangFull Text:PDF
GTID:2558307118496324Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As a difficult point in speech signal processing,the cocktail party effect refers to the ability of the human ear to focus on one’s conversation and ignore other irrelevant sounds in a noisy environment.Although the research on automatic speech separation has been very intensive,the technology is still a hot topic in speech processing.The single-channel speech separation technology can be used with only one microphone and its deployment cost is low,so it has a wide range of application scenarios.In recent years,the single-channel speech separation method based on self-attention mechanism has achieved the best results in the field of single-channel speech separation.However,due to its high complexity and low model efficiency,the separation algorithm is difficult to implement on smart devices.Therefore,this thesis focuses on improving the efficiency of speech separation model based on self-attention mechanism,and further improving the performance of the model.The main content is as follows:1)Due to the high complexity of the current single-channel speech separation model based on the self-attention mechanism,an improved fast self-attention mechanism is proposed to improve the efficiency while preserving the performance of the model as much as possible.First,this thesis designs a localwindows self-attention mechanism to reduce the computational complexity of the selfattention mechanism.At the same time,deep dilated convolutions are introduced into the position-wise feed-forward Networks to exchange information between local windows.Furthermore,due to the lack of fine-grained information between adjacent elements in the self-attention mechanism,this thesis proposes an adjacent detail branch to extract fine-grained information between adjacent speech units.Compared with the original speech separation model based on the self-attention mechanism,the parameters of the separation model based on the fast self-attention mechanism are reduced by about 48.87%,the amount of computation is reduced by about 53.74%,and the real time factor is increased by about 27.99%.But the performance of the separation model dropped by only 0.98% on average,which shows that the proposed separation model based on the fast self-attention mechanism greatly reduces the complexity of the model with only a little performance loss.2)To improve the performance of speech separation based on the fast selfattention mechanism,the conditional layer normalization technique for information fusion is designed to introduce the identity information of the target speaker into the separation network,so as to realize the speech separation model of the target speaker based on the fast self-attention mechanism.This thesis uses the end-to-end speaker recognition module to extract the speaker embedding vector,then the speaker embedding is passed through the conditional layer normalization layer to constrains the distribution of parameters in the network,so that the network can better separate the target speaker’s speech from the mixed speech.Compared with the original speech separation model based on the self-attention mechanism,the experimental results show that the target speaker speech separation model based on the fast self-attention mechanism using conditional layer normalization for information fusion reduces the parameter amount by about 45.4%,and amount of computation decreased by 53.4%,and the real-time factor increased by 24.2%.The performance of the separation model is improved by an average of about 4% and 8% in the clean environment and the noisy environment,respectively,which shows that the separation model in this thesis is more robust in the noisy environment.
Keywords/Search Tags:Speech Separation, Fast Self-Attention Mechanism, Conditional Layer Normalization, Speaker Embedding
PDF Full Text Request
Related items