Font Size: a A A

Research On Voice Activity Detection Technology Based On Attention Mechanism

Posted on:2023-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:S LiFull Text:PDF
GTID:2568306782963729Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Voice activity detection(VAD)refers to distinguishing speech/non-speech segments of audio signals in noisy environments and determining the starting/terminating points.Nevertheless,speech signals are easily distorted by high-intensive noise in low signal-tonoise ratio(SNR)environment,and thus the detection accuracy of VAD will decrease significantly.In order to improve the performance of VAD in low SNR environment and its robustness to non-stationary noise,this thesis first introduces the deep learning framework of VAD,mainly including the preprocessing,acoustic features and popular networks,and then explores the application of attention mechanism in recurrent neural networks and residual convolutional networks from the perspective of attention mechanism and deep learning.The main work is as follows:(1)Firstly,we analyze how to generate the contextual sequences of speech frames as the extended context information.Based on self-attention,we propose a fusion model of local attention and global attention: the self-attention operation on the LSTM unit is performed by using an attention enhanced long short-term memory networks(AELSTMs),which can greatly enhance the contextual information modeling capacity of recurrent networks by tracing back to historical cell states and then discovering valuable features;The global attention learns the global attention distribution among the contextual sequences,and impels the model to focus on the frames in the contextual sequences that are helpful for classification,and suppresses the unnecessary noise.The experimental results show that the area under the curve(AUC)at-10 d B SNR exceeds the state-of-theart attention-based VAD methods by 3.2%,and achieves 86.89% in low SNRs of-15 d B,and also achieves good adaptability under non-stationary noisy environment;(2)Then,aiming at the limitations of the method of generating contextual sequences according to fixed range and step,inspired by maximum inner product search(MIPS),a self-attention inspired locality sensitive hashing algorithm is proposed,which can achieve dynamic short-time and efficient contextual frame search;(3)Finally,current attention mechanisms used in VAD do not consider the temporalfrequency characteristics of speech.Aiming at this problem,a two-dimensional residual frequency-temporal attention network is proposed based on the residual convolutional neural network,and the one-dimensional frame level spectrum is modeled.In terms of the characteristics of speech signal,we design a convolutional attention strategy of successive channel,spectrum and temporal modeling,and an interval branch is set in each channel attention to promote and guide the feature learning of temporal attention and frequency attention.The AUC is generally better than the proposed method(1)and the residual networks used in VAD by 1.27% and 1.89%,respectively.High stability is additionally achieved under-15 d B SNR with an AUC of 92.80%.
Keywords/Search Tags:Voice Activity Detection, Attention Mechanism, Long Short-Term Memory Network, Locality Sensitive Hashing, Residual Convolutional Network
PDF Full Text Request
Related items