| In the last ten years,with the rapid development of neural network-based technology,artificial intelligence technologies represented by speech interaction technology has been widely used in many industries and products,especially in mobile phone input method,speech assistant,automobile,televisions and smart speakers fields.In addition,the applications of speech technology is becoming more and more popular in many hardware products such as white household appliances,robots of various forms,subway ticket machines and so on.At present,the overall performance of mobile phone speech input has reached a high accuracy rate,and it has been well accepted by various groups more and more.However,in the era of the AIOT,the technical difficulty of speech control of smart hardware products is significantly different from that of mobile phone speech input.When we use speech input use a mobile phone,the distance between the person and the equipment is very close,so the signal-to-noise ratio of the speech signal captured by the mobile phone is very high.However,speech recognition technology for various types of intelligent hardware is much more difficult.The main reasons are that,firstly,there will be various complex background noises in the environment of equipment such as automobiles,robots,and home appliances,and secondly,in intelligent hardware application scenarios,the distance between people and hardware devices is often far,which will cause the target speech signal captured by the device gradually attenuates.At the same time,the reverberation effect caused by long-distance speech transmission will also seriously affect the clarity and intelligibility of the signal.The existence of these factors will seriously affect the speech interaction experience.We usually divide the research of interactive speech recognition into front-end system and back-end system.Among them,the front-end system refers to performing various preprocessing on the speech signal captured by the equipment,eliminating the environmental noise,human voice interference,reverberation and other influencing factors contained in the signal,and extracting a clean target voice signal.The back-end system mainly converts the input speech signal into text through the process of acoustic model,language model and decoding.This paper conducts research on a variety of speech preprocessing technologies in the front-end system for interactive speech recognition.The goal is to improve the quality of the target speech signal by improving the speech preprocessing technology,thereby improving the speech interaction experience.First of all,in view of the unsatisfactory effect of speech noise reduction in traditional single channel speech enhancement,we propose a new speech enhancement matrix regression model based on Fully CNN,which can directly realize the input of 2D-2D mapping from noise log spectrum input to time-frequency mask output,and,through the deep fusion application of small size convolution filters and multi-objective learning,the speech intelligibility and speech recognition rate are significantly improved.Secondly,traditional beamforming speech enhancement algorithms are difficult to accurately estimate spatial statistics in multi-speaker scenarios,and there are many residual noise components.Although deep learning has strong nonlinear capabilities,it is strongly dependent on data scale and quality.Based on this status quo,we propose a method named spatial and speaker aware iterative idea mask estimation(SSA-IME)for multi-microphone deep learning speech enhancement.This algorithm achieved the lowest word error rate in the Track 1 task of CHiME-6 Challenge.For the speech dereverberation technology,we propose a method of estimating the prediction filter with a neural network model to solve the problem of slow convergence of linear prediction algorithms,and design a causal model structure with small parameters for the real-time interaction requirements of the embedded platform,significantly improving the reverberation effect of the first sentence of the speech interaction.And for the multi-microphone scene,a comprehensive dereverberation system is designed,which combines the advantages of various dereverberation algorithms,and has better dereverberation effect and stability.Finally,facing speech problems caused by complex noise in automotive scenarios,we propose a speech enhancement method using two microphones.By researching and analyzing the types of noise data in the car driving environment,a speech presence detection method based on deep neural network is designed,and it is applied in the subsequent process of relative transfer function estimation,beamforming and post-filtering.Experiments prove that this enhancement method has achieved good results in practical automotive applications.The speech enhancement module equipped with this solution has also achieved good commercialization results. |