Font Size: a A A

Research On Key Technologies Of Underwater Low Bit Rate Speech Coding Based On Deep Learning

Posted on:2024-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:L H ZhangFull Text:PDF
GTID:2568306944956379Subject:Underwater Acoustics
Abstract/Summary:PDF Full Text Request
In underwater communication,the bandwidth requirements of underwater communication often necessitate the use of low-rate speech coding schemes when communicating with underwater divers.These schemes transmit and synthesize speech features to reduce the transmission rate.However,in the complex and noisy underwater environment,traditional lowrate speech coding has several issues.Firstly,influenced by the characteristics of human language,there are often many non-speech frames in speech signals that do not contain information.Transmitting these non-speech frames reduces transmission efficiency.Secondly,the transmission of unprocessed noise degrades the synthesis quality of decoded speech,affecting speech quality and intelligibility.Lastly,low-rate speech coding often results in lowquality synthesized speech and reduced intelligibility.To address these issues,this paper proposes a system based on Mixed Excitation Linear Prediction(MELP)for underwater communication.It utilizes neural networks to establish classification modules,noise reduction modules,and parameter optimization modules.The aim is to improve the transmission rate of the system and the speech quality at the synthesis end.The main research content of this paper is as follows.1.Before the speech input to the MELP system,in the speech preprocessing stage,the system in this paper introduces a classification module and a noise reduction module by utilizing neural networks.The classification module classifies the input speech into speech and non-speech segments.For the speech segments,they are passed through the noise reduction module for speech enhancement before being input into the MELP system for feature extraction and transmission.As for the non-speech segments,the system only performs statistical analysis on the quantity and transmits the information.This approach aims to improve the transmission rate of semantic information and the speech quality in the MELP system.The paper utilizes Recurrent Neural Networks(RNN)and Stacked Auto-Encoders(SAE)in neural networks to establish a classification model.The input speech signals are segmented into frames,including speech frames and non-speech frames.A test set is created by randomly selecting 100 segments of speech with male and female voices of different ages.The test set is then augmented with underwater bubble noise at signal-to-noise ratios(SNRs)of-10 d B,-5d B,0d B,5d B,10 d B,15 d B,and 20 d B.In addition,simulated diver inhalation sounds,which are enhanced through data augmentation,are added to each speech segment.The classification model results are compared with the results of the traditional dual-threshold method when facing underwater nonstationary noise.The comparison shows that the proposed model outperforms the traditional dual-threshold method for SNRs ranging from-10 d B to 20 d B.Furthermore,RNNoise is introduced to perform noise reduction on the input speech.RNNoise combines deep learning and traditional digital signal processing techniques.It replaces the traditional spectral subtraction method in the denoising process by using deep learning.The neural network adjusts the denoising weights in different frequency bands.The denoising model is trained using RNNoise,and it is compared with traditional denoising methods such as spectral subtraction and Wiener filtering through simulation experiments.The comparison results demonstrate that the denoising model outperforms traditional denoising methods in terms of speech quality and SNR,regardless of whether it is dealing with stationary noise like Gaussian white noise or nonstationary noise like underwater bubble noise.2.This paper selects the MELP low-rate speech coding scheme as the transmission scheme and implements the MELP low-rate speech coding system.To improve the output speech quality of MELP,an optimization scheme is designed to address the quantization errors in vector quantization.The scheme involves fitting the codebook using a neural network,establishing a model,and using the model to modify the parameters received at the decoding stage.This improves the synthesis quality of the MELP vocoder.Initially,MELP is chosen as the low-rate speech coding scheme,and its encoding and decoding principles are analyzed and implemented through code deployment and simulation.In the process of transmitting speech in the MELP system,features are extracted and quantized at the encoding end,and the quantized features are transmitted.At the decoding end,the transmitted features are decoded and dequantized,and the synthesized speech is generated using these features.The accuracy of the features significantly affects the fidelity and quality of the synthesized speech.The quality of the speech synthesized using accurately transmitted parameters is higher than that synthesized using incorrectly transmitted parameters.Therefore,to address these issues,this paper utilizes the multi-head attention mechanism in neural networks to establish a model that compensates for the quantization errors in the transmission of speech feature parameters.Experimental results show that the speech output from this module improves the Mean Opinion Score(MOS)rating by approximately 0.1 points on a scale of-0.5 to 4.5.3.Deploy and optimize the above model and MELP system,and introduce subjective and objective speech evaluation standards to evaluate the output results.This article optimizes the above modules for real-world scenarios and encapsulates them.Through calling dynamic libraries using Python,the software design and specific implementation plan of the optimized low-rate speech communication system are completed.In the actual speech processing process,the system performs noise reduction transmission on speech frames and packs non-speech frames in the form of recording quantities for transmission.PESQ and DRT are introduced as evaluation standards for speech quality and intelligibility,and the output results of each module and the system are evaluated.Using a real speech segment as an example,the required speech length for the system to transmit decreased by 51.62%,while the PESQ score increased from 1.984 to 2.442 and the DRT score increased from 85.6 to 92.1.
Keywords/Search Tags:low bit rate speech coding, deep learning, speech preprocessing, vector quantization
PDF Full Text Request
Related items