| With the progress of the times and the development of science and technology,people are fully prepared for the explosion of deep learning technology at the level of training data and hardware equipment.In recent years,with the rise of media platforms such as user-generated content,various types of data on the Internet have ushered in a blowout explosion;under the law of Moore’s Law,the computing power of hardware has reached an exponential level of growth.Various basic theories of artificial intelligence represented by neural networks are being updated and iterated at a rapid rate.Old problems in many fields have since received new methods,that is,from traditional solutions to exploring deep learning solutions.Voice activity detection is a pre-step for tasks in many voice directions,such as speech recognition,speech enhancement,and speech coding.Speech frames,where speech frames contain speech content,and non-speech frames do not contain speech content.This task usually requires a smaller model,faster speed and better results.Traditional artificially designed voice activity detection schemes often lack robustness.This dissertation combines the advantages of deep neural networks,long short-term memory networks and convolutional neural networks to construct a CLDVAD network for voice activity detection,using CLDNN network as the baseline model.Among them,the convolutional neural network is used to extract and fuse the abstract features of several consecutive frames of data,and it has excellent characteristics such as translation invariance and flip invariance;the long short-term memory network is used to model the time series of the input speech;the deep neural network is used to Mapping data features into a higher-dimensional,more discrete space allows the model to better classify.In this dissertation,the ability of the voice activity detection model is explored by changing the model network structure,input features and the size of the rear field of view of the model,so that it can finally achieve faster detection speed and relatively better model effect on a smaller model.The main contributions of this paper are summarized as follows:(1)This dissertation compares and analyzes the influence of different input features on the effect of the voice activity detection model,including Filter Bank 40,Filter Bank 75,MFCC,Raw Wave,and finally selects the Filter Bank 40 feature as the model output on the comprehensive optimal model.(2)The CLDVAD model proposed in this dissertation only needs a small amount of back view,which greatly reduces the time for the model to wait for future frames,and improves the prediction speed of the model.Finally,the latency of the model with the best comprehensive performance is 10 ms.(3)Compared with the CLDNN baseline model,the CLDVAD model proposed in this dissertation has a reduction in the amount of parameters and computation.(4)The accuracy of the proposed CLDVAD model in the speech activity detection task has been greatly improved compared with the CLDNN baseline model.In summary,the research topic of this dissertation is voice activity detection,and the method used is a deep learning algorithm.Finally,the detection efficiency,detection accuracy and algorithm complexity are optimized to a certain extent compared with the baseline CLDNN model,could be used in realistic scenario. |