Font Size: a A A

Application Of Multi-Task Based Audio Feature Extraction In Audio Captioning System

Posted on:2023-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:K ChenFull Text:PDF
GTID:2558306914959189Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Automated audio captioning is a cross-modal translation task to describe the audio information with a complete sentence.For example,giving an audio clip,the system generates descriptive text of "a large group of people talking." Due to the difficulty of labelling data,it is hard to obtain enough datasets for audio captioning task,which leads to the difficulty of model training.This paper proposes to use auxiliary tasks and transfer learning to extract more effectively audio features.(1)In audio captioning,it is challenging to establish the mapping between audio and text with limited data.In this paper,an audio captioning model based on keyword prediction task is proposed.To optimize the model,a keyword prediction task was designed as an intermediate task to reduce the difficulty of model training.The final model achieved the best performance without using external data,and achieved a SPIDEr score 22.7.(2)To provide the model with enough knowledge and solve the problem of insufficient training data,this paper proposes an audio captioning model based on transfer learning and attention mechanism.This method perform an audio tagging task on the AudioSet dataset,and transfers the knowledge learned on this task to the audio captioning task.Meanwhile,to obtain an efficient audio features,an attention-based feature extraction module is used to make the audio captioning model pay more attention to important features and filter out irrelevant informations.The proposed method achieves a SPIDEr score of 27.8 on the Clotho dataset.Currently,the best method based on transfer learning achieves a SPIDEr score of 27.0,which shows the effectiveness of the proposed system.(3)In audio captioning task,effective audio features help generate more accurate descriptions.This paper proposes a multi-task feature combination audio captioning system.The encoder module extracts frame-level and segment-level audio features,and the keyword prediction module obtains word-level features to guide the generation of decoder.This system achieves a SPIDEr score of 28.3 on the Clotho dataset.In order to solve the mismatch between the training objective and the evaluation metrics,reinforcement learning was further introduced to directly optimize the model with evaluation metrics.The final SPIDEr score was improved to 30.7,which is comparable to the current best performance.
Keywords/Search Tags:automated audio captioning, cross-modal translation, transfer learning, reinforcement learning
PDF Full Text Request
Related items