| Due to the characteristics that cloud video conferencing can span time and space,it has been increasingly applied to social practice in recent years.However,in the transmission process between the cloud video conference terminal and the server,the signal is easily mixed with noise and interfered.As an important information transmission carrier for cloud video conferences,the quality and clarity of sound signals are greatly affected.The loss will seriously affect the artificial intelligence applications of cloud video conferencing,such as smart subtitles,smart minutes and other functions.Therefore,a real-time denoising algorithm that can well serve back-end voice applications is crucial to ensure the better development of cloud video conferences.This sound signal denoising algorithm is also called speech enhancement.Speech enhancement generally refers to removing the noise interference contained in the speech signal mixed with noise by certain technical means,so that the noisy speech signal is closer to the clean sound signal.Traditional speech enhancement algorithms generally make some assumptions about the composition of noisecontaining speech,and then traditional algorithms remove noise based on mathematical assumptions.This assumption-based algorithm not only cannot effectively deal with the nonstationary noise mixed in the sound signal,It is also possible that new noise is further mixed into the sound signal to make the back-end speech recognition application less effective,and traditional speech enhancement algorithms have been unable to adapt to the application of contemporary cloud video conferencing.Since the well-known multilayer perceptron was proposed,deep learning has been greatly developed and applied to various fields to deal with related problems.Deep learning models are able to capture the deep features of the sound signal to learn the unique properties of the noise-free speech signal.Compared with the traditional speech enhancement algorithm,the deep learning algorithm does not need to make assumptions about the speech containing noise,and is more suitable for the current living environment with complex noise.Based on the advantages of deep learning,this thesis proposes DSAGAN speech enhancement algorithm research that meets the needs of cloud video conference.The main research work of this thesis is as follows:(1)This thesis designs a speech enhancement model for cloud video conferencing based on generative adversarial networks.The generator uses the encoder module for feature extraction of noisy speech,and is able to estimate the feature code of the noisy speech signal as a clean speech signal through the decoder module.By adding a dense connection module to the generator structure,the attention module and other structures effectively improve the PESQ and STOI values of the estimated speech.(2)This thesis studies the method of improving the voice application of the back-end voice enhancement network service.Referring to the principle of the DNN-HMM algorithm,the discriminator of the generative adversarial network with phoneme loss is designed,so that the enhanced voice can retain more voice information,WER value dropped significantly.(3)Based on the self-attention mechanism that has achieved excellent results in many fields,the problem of real-time computing is further considered,and a sparse attention structure that saves more computing resources and time is designed.In the process of attention calculation,the weight vector is grouped by a clustering algorithm,which greatly reduces the computational complexity of attention.(4)The voice enhancement model can obtain better results by increasing the volume of the network,but the huge model will affect the real-time performance in the cloud video conference.This thesis designs a method for pruning and quantification after the training of the voice enhancement model is completed.The experiment compares the impact of fixed-point quantization and unstructured pruning on the speech enhancement model,and determines the acceleration method that is more suitable for the speech enhancement model.(5)This thesis designs and implements the architecture and functions of the voice enhancement system based on the HUAWEI CLOUD video conferencing system. |