| Audio super-resolution refers to the audio processing technique that reconstructs superresolution audio signals from low-resolution audio signals.In time domain,it increases the original sampling rate of the audio signals by establishing the mapping relationship from the low-resolution audio waveform to the super-resolution audio waveform,while in frequency domain,it expands the bandwidth of narrowband audio signals by reconstructing the missing high-frequency components of the audio signal.The accurate and efficient audio superresolution algorithm has a wide range of application scenarios.It can not only be applied to the traditional audio communication field,but also can be cross-integrated with other signal processing technique to improve their reliability and accuracy.Therefore,audio superresolution has important research significance and practical value.This thesis studies audio super-resolution based on deep learning methods.The main work is as follows:First,we propose an audio super-resolution algorithm based on Taper residual dense Network(TNet).To address the problem that the current audio super-resolution algorithms based on the UNet framework are difficult to fully utilize the internal features of each level,we design a feature extraction and fusion branch based on the residual dense block.This branch is suitable for processing one-dimensional audio signals and easily extensible.Then,we design a taper network architecture that combines the advantages of time-domain and frequency-domain modeling.Multiple lower branches of the architecture focus on different features of audio signals for super-resolution modeling and feature extraction,and finally feature fusion and output are performed in the upper branch.The experimental results demonstrate that the proposed algorithm can achieve better audio super-resolution reconstruction effects in the frequency domain and time domain indicators with a lower number of parameters.Second,we further improve TNet with self-attention mechanism.Aiming at the problem that the receptive field of pure convolutional network is limited by the size of convolution kernel,the Attention-based Feature-Wise Linear Modulation(AFi LM)module is integrated on the basis of taper residual dense network.It allows the model to mine the contextual information of the audio waveform at different locations in parallel,thereby capturing the potential longterm dependencies and further enhancing the representation ability of the model.To address the problem that the frequency-domain phase branch is prone to gradient explosion in the training stage,the time-domain and frequency-domain fusion methods of the taper residual dense network are improved.What’s more,in view of the fact that using the mean square error loss to train the model can easily lead to overfitting in time domain,we take use of the timefrequency loss to encourage the model to reconstruct more frequency-domain details when fitting the time-domain waveform of audio signals.Experiments show that AFi LM-TNet can greatly improve the performance of the reconstructed audio in frequency domain while improving the time-domain indicator,and generate a more realistic logarithmic power spectrum. |