Font Size: a A A

Automatic Auido Captioning Based On Reinforcement Learning

Posted on:2023-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:G Y ChenFull Text:PDF
GTID:2568306836972449Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Automated audio captioning is a cross-modal text generation task.Automated audio captioning aims to use natural language to describe the content of the input audio data.Compared with the traditional tasks such as audio classification,Automated audio captioning is more complex,but it also has a broader application prospect,such as providing convenience services for the disabled.Existing works mainly focus on investigating new methods and try to improve their performance measured on existing datasets.Due to the lack of cross-modal pre-trained resource,few try to improve the system with pre-trained model.At present,there are some problems in the field of audio captioning,such as too few datasets available and poor captions generated by decoder.In order to solve the above problems,this paper presents an audio captioning system with an encoder-decoder architecture,where the decoder predicts words based on audio features extracted by the encoder.After that,we use different single modal pre-trained resource to improve the system.The specific work contents are as follows:(1)In order to improve the performance of the multi-modal system with single modal pre-trained resource,this paper attempts to take advantage of two audio modal resource.Firstly,we introduce an audio pre-trained resource: the pre-trained model PANNs which comes from audio classification task into our audio captioning system and use it to initialize the parameters of our encoder.Then,we use the Audio Caps dataset to pre-train the overall system,so that our encoder can extract the audio feature information that is better for the decoder to generate caption sentences.The experimental results of training on Clotho dataset show that the single audio modal pre-trained resources we use are effective and can significantly improve the performance of the multi-modal system of audio annotation.(2)In addition to the single audio modal pre-trained resource,this paper also explores the application of single text modal pre-trained method to the audio captioning system.We use a pretrained method comes from reinforcement learning which aims at text evaluation metrics to improve the performance of the system,so that the model can generate better captions.Finally,we combine the pre-trained methods of audio modal and text modal.The experimental results of training on Clotho dataset show that the pre trained methods of audio modal and text modal we use can greatly improve the final performance of the cross-modal system of automated audio captioning.
Keywords/Search Tags:Automated audio captioning, Reinforcement learning, Transfer learning, Deep learning, multi-modal task
PDF Full Text Request
Related items