| Under the promotion of modern lifestyle,the emergence of voice control products has liberated the hands of users,and voice wake-up is all the more important as the entrance of voice control.At present,many studies on voice wake-up are limited to scenes in quiet environment,ignoring all kinds of environmental noise that may appear in practical applications.In addition,voice wake-up models are generally deployed on resource-constrained mobile or embedded devices,so they need to be lightweight and computationally efficient.Therefore,the research in this thesis aims to improve the performance of the voice wake-up model in noisy scenario and ensure its suitability for deployment in an embedded platform.This thesis proposes a high-accuracy voice wake-up model to process the data features by analyzing the distribution characteristics in time and frequency dimensions from the frequency domain using sub-frequency spectrum normalization and frequency data enhancement.We propose a time-frequency feature extraction module based on onedimensional dilated convolution,combined with the gate activation function,to efficiently extract the features of speech data.The channel attention mechanism is improved so that the data flowing through this module is fused with the backbone network in a relative proportion.According to the time series characteristics of speech data,a temporal attention module based on the multi-head attention mechanism is added at the end of the network.The accuracy of the final model is 97.45%,and the number of parameters is 171.8K.The accuracy is only 0.95% behind the current advanced MHAttRNN model,but the number of parameters is 1 / 5.In order to improve the the model’s noise resistance performance to adapt to realworld application scenarios,this study utilized the commonly-used Musan dataset for speech enhancement to synthesize noisy data with signal-to-noise ratios ranging from-5d B to 15 d B based on the Google speech command dataset,addressing the issue of the lack of noise scenarios in the wake-up word dataset.A modified spectral subtraction method was proposed and employed as a noise reduction front-end for the model.From the perspective of the model,inspired by the ideas of speech enhancement,the mean squared error was used to improve the loss function and incremental training was performed on the noisy data.On the noisy dataset,a accuracy rate of 96.72% was achieved,which was 6.7% higher than the accuracy rate obtained from the model trained solely on clean speech.In order to meet the requirement of ultimately deploying the model on an embedded system,this paper proposes a further lightweight design approach for noise-resistant voice wake-up models.This thesis uses the depthwise deparable convolution replace the standard convolution in time frequency feature extraction module,the channel attention module all parameter sharing,compressed 74% of parameters,reduce it to 44.62 K,compare the other lightweight model of the same size,noisy scenario accuracy has the advantage of 1% to 5%.In terms of model compression,the quantization aware training and quantitative inference of the model were conducted,and the model test accuracy remained at 95.42% in the noisy scenario,the detection of 1s wake-up word achieved the average delay of 13.45ms... |