Font Size: a A A

Research On Speech Enhancement Based On Improved Transformer

Posted on:2024-04-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2568306935483794Subject:Electronic information
Abstract/Summary:PDF Full Text Request
The appearance of noise is easy to cause interference in the transmission process of speech signal,and speech enhancement technology is a technology that can purify the noisy speech and restore the original clean speech as far as possible.Nowadays,traditional speech enhancement techniques have exposed many drawbacks when dealing with complex noise environment and non-hypothetical environment conditions,that is,traditional speech enhancement algorithms need to establish assumptions in the process of noise reduction,which is easy to produce speech distortion under the condition of low signal-to-noise ratio.In addition,with the increasing demand for speech quality,the voice enhancement technology is also put forward higher requirements.Researchers in this field apply deep learning to speech enhancement to solve the problem that traditional speech enhancement technology is difficult to solve,that is,it can restore clean speech under no set conditions and complex environment,and has made considerable achievements.In recent years,with the development of the Internet of Things and 5G technology,people can make video,call,live broadcast and conference with anyone around the world anytime and anywhere,but at the same time,people have higher and higher requirements for voice quality.Low understandable voice and loud voice can not only fail to deliver effective information,but also seriously reduce users’ satisfaction.However,the existing speech enhancement technology has exposed many drawbacks when dealing with complex noise environment and non-hypothetical environment conditions: 1.It is difficult for traditional speech enhancement algorithms to understand the contextual speech information,and it is difficult to deduce the complete pure speech through the contextual logic relationship in the noisy environment.2.Although the existing speech enhancement methods based on deep learning have improved the performance,they have caused a large increase in the number of parameters and computation,which seriously hinder the application of speech enhancement methods.Therefore,this paper combines Transformer,FCN and SPIB data sets to establish a new voice noise reduction algorithm with FCN+Transformer,and verifies and tests it on SPIB data sets.Specific research contents include the following two parts:As convolutional neural network has certain limitations in receptive field,it is difficult to understand contextual speech information,which limits the performance of speech recovery.This paper mainly studies combining Transformer module with full convolutional neural network to further improve the quality and intelligibility of speech by obtaining global context.A voice noise reduction algorithm for single channel fusion FCN+Transformer is proposed.Specifically,symmetrical full convolutional neural network is taken as the backbone of the network and Transformer module is added,so that the network has the ability to extract local features and combine context.In the experiment,SNR,PESQ and STOI evaluation indexes were used to compare the depth of different convolutional layers and the amount of attention of different multiple heads to find the optimal parameters.The experimental results show that the improved algorithm has obvious improvement in speech quality and speech intelligibility.Although the Transformer module itself brings performance improvements,it also causes the problem of increasing the number of parameters.To solve this problem,this paper proposes a voice noise reduction network based on improved Transformer.Full convolutional neural network is still used as the basic framework,and the improved Transformer module is used to replace part of the convolutional layer with the original Transformer module.In the improved Transformer module,the linear transformation with high calculation cost is replaced by convolution operation to reduce the calculation cost.This paper also discusses the concept of relative location coding into multi-head attention to improve distance perception.Experiments show that the proposed algorithm is not only lighter,but also improves the speech quality.
Keywords/Search Tags:Deep learning, Transformer, Full Convolutional Neural Network, Speech Enhancement
PDF Full Text Request
Related items