Font Size: a A A

Deep Neural Network-based Acoustic Signal Synthesis And Separation Research

Posted on:2024-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:P LinFull Text:PDF
GTID:2568307142981999Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Acoustic signal processing has been one of the core research directions in the field of signal information processing and is the basis for speech communication.Acoustic signal processing generally includes speech recognition,speech separation,speech emotion recognition,speaker recognition,speech synthesis,etc.The purpose of speech separation is to separate the target speech signal of interest from the mixed speech signal.The current speech separation systems modify the mixed speech signal by doing so,but suffer from two problems:insufficient speech suppression and excessive speech suppression.These problems can cause distortion of the separated speech and impair the quality of the speech signal.Speech synthesis systems generate high-quality speech from textual information only,and generating realistic acoustic representations without a reference audio signal is a difficult task.In order to solve the above problems,this paper proposes a related algorithm for acoustic signal processing.The frequency domain approach for single channel speech separation suffers from phase distortion,which limits the performance of speech separation.The time-domain approach uses an encoder-decoder framework to directly model the mixed waveform,avoiding the phase prediction problem.The time-domain approach processes the time-domain waveform of the signal directly,resulting in ineffective use of the acoustic information contained in the time-frequency representation.This paper attempts to combine time-domain and frequency-domain features in the hope of exploiting the advantages of both the time and frequency domains by connecting time-and frequency-domain features to construct a time-frequency feature map on which joint cross-domain embedding and clustering is performed,thus enabling the model to learn the behaviour of the signal in both domains and the correlation across them.Two cross-domain feature selection modules are proposed in the encoder: trainable weights and a global cross-domain feature selection module.The use of trainable weights allows the network to learn the importance between different domain features directly,while the use of global information improves the generalisability of the network.Both approaches improve the performance of the cross-domain separation model.Experiments on a public benchmark dataset demonstrate that the separation model proposed in this paper exhibits significant improvements over previous work,with SDR = 16.6 db and SI-SNR = 16.9 db.In addition,the deeply inflated convolution implemented in this paper significantly reduces the parameter size by nearly 1/3.The most difficult part of the speech synthesis task is to predict the true rhythm(timing information,pitch and loudness contours)from plain text.In this paper,the acoustic parameters of clean speech are predicted from noise by combining both separation and synthesis methods,using a vocoder to synthesise the speech waveform.As the noise signal is resynthesised,the output speech quality will be higher than that of a standard speech separation and denoising system.A convolution-free waveform generator is proposed based on the self-attention module.In order to reduce the computational complexity of the self-attention module,the model efficiency is improved by computing local attention within a sliding window,allowing the network to be trained and synthesised efficiently.Experimental results show that the objective metric score,model size and synthesis speed of the improved neural vocoder in the parameter synthesis model proposed in this paper are superior in terms of metric results.
Keywords/Search Tags:Speech separation, Speech synthesis, Neural vocoder, Feature selection module, Self-attention
PDF Full Text Request
Related items