Application Of Multi-Task Based Audio Feature Extraction In Audio Captioning System

Posted on:2023-04-13

Degree:Master

Type:Thesis

Country:China

Candidate:K Chen

Full Text:PDF

GTID:2558306914959189

Subject:Electronic Science and Technology

Abstract/Summary:

PDF Full Text Request

Automated audio captioning is a cross-modal translation task to describe the audio information with a complete sentence.For example,giving an audio clip,the system generates descriptive text of "a large group of people talking." Due to the difficulty of labelling data,it is hard to obtain enough datasets for audio captioning task,which leads to the difficulty of model training.This paper proposes to use auxiliary tasks and transfer learning to extract more effectively audio features.(1)In audio captioning,it is challenging to establish the mapping between audio and text with limited data.In this paper,an audio captioning model based on keyword prediction task is proposed.To optimize the model,a keyword prediction task was designed as an intermediate task to reduce the difficulty of model training.The final model achieved the best performance without using external data,and achieved a SPIDEr score 22.7.(2)To provide the model with enough knowledge and solve the problem of insufficient training data,this paper proposes an audio captioning model based on transfer learning and attention mechanism.This method perform an audio tagging task on the AudioSet dataset,and transfers the knowledge learned on this task to the audio captioning task.Meanwhile,to obtain an efficient audio features,an attention-based feature extraction module is used to make the audio captioning model pay more attention to important features and filter out irrelevant informations.The proposed method achieves a SPIDEr score of 27.8 on the Clotho dataset.Currently,the best method based on transfer learning achieves a SPIDEr score of 27.0,which shows the effectiveness of the proposed system.(3)In audio captioning task,effective audio features help generate more accurate descriptions.This paper proposes a multi-task feature combination audio captioning system.The encoder module extracts frame-level and segment-level audio features,and the keyword prediction module obtains word-level features to guide the generation of decoder.This system achieves a SPIDEr score of 28.3 on the Clotho dataset.In order to solve the mismatch between the training objective and the evaluation metrics,reinforcement learning was further introduced to directly optimize the model with evaluation metrics.The final SPIDEr score was improved to 30.7,which is comparable to the current best performance.

Keywords/Search Tags:

automated audio captioning, cross-modal translation, transfer learning, reinforcement learning

PDF Full Text Request

Related items

1	Automatic Auido Captioning Based On Reinforcement Learning
2	Research On Cross-Modal Learning Methods For Audio-Visual Association
3	Multimodal Cognitive Learning For Audio-visual Data
4	Research On Cross-modal Retrieval Based On Deep Learning And Transfer Learning
5	Research On Automated Audio Captioning Based On Fine-Grained Semantic Information Perception
6	Cross-modal Retrieval Research Based On Reinforcement Learning
7	Research On Visual Captioning Algorithm For “Visual-Linguistic” Cross-Modal Semantic Alignment
8	Research On Audio-Visual Event Localization And Recognition Based On Cross-Modal Learning
9	Deep Multimodal Attention Learning For Image Captioning
10	Research On Geometric Solution Method Based On Cross-modal Learning