Bimodel Speech Separation Based On Dynamic Feature Points

Posted on:2024-07-22

Degree:Master

Type:Thesis

Country:China

Candidate:Q Lu

Full Text:PDF

GTID:2568306941992679

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

In noisy environments,people are able to focus on a particular conversation and ignore background noise,a phenomenon known as the "cocktail party problem".Additionally,people can improve their ability to hear in acoustically complex environments by observing the speaker’s facial expressions.In today’s world of widespread intelligent devices,it is an important research direction in speech front-end processing to enable machines to have this ability and expand their application scenarios.Speech separation,as a method for separating the target speaker’s voice from a mixed signal containing multiple voices,is an important technology for solving this problem.Furthermore,utilizing visual modality information to improve the quality of speech separation when conditions permit is an important way to further optimize the service level of intelligent devices.When constructing multimodal models,how to design better model structures and optimize the way information is exchanged among multiple objectives are all issues that need to be explored.In response to the above situation,this paper conducts research on a multimodal speech separation model based on deep learning and dynamic lip feature points.The main contents are as follows:This paper first conducts research on single-modal models,constructing speech separation models using two approaches: time-domain coding separation and time-frequency domain separation.Performance comparisons were conducted through experiments.The experimental results show that the time-domain coding model,which uses a trainable encoder and decoder to decompose and reconstruct speech,has better separation performance,and therefore,it was extended as a baseline model to multimodal models.In the multimodal speech separation task,preprocessing operations that include speaker videos were added,and the extracted dynamic lip feature points were integrated into the speech separation task.The structure of the multimodal model was then studied,adding visual and auditory feature extraction modules,with feature channel concatenation as the feature fusion method.Separation modules based on convolution and dual-path recurrent networks were constructed.The experimental results show that the model has better separation performance when incorporating visual modality,and the dual-path recurrent separation module is more effective than convolution when modeling long sequences of speech.Finally,based on the shortcomings in feature fusion of the existing models,this paper designed a dual-attention multimodal model that fully considers the feature interaction among multiple objectives,effectively avoiding the "order" problem among multiple visual features.This model not only uses fewer network parameters,but also further improves the speech separation performance.

Keywords/Search Tags:

speech separation, feature points, bimodal, attention

PDF Full Text Request

Related items

1	Research On Technologies Of Audio-Visual Bimodal Speech Recognition Based On Attention Mechanism
2	Multi-speaker Speech Separation Based On Deep Learning
3	Research On Speech Separation Method Based On Causal Feature Input And Multi-Scale Feature Fusion
4	Bimodal Speech Recognition Technology Research Based On Audio And Video
5	Research On Expression And Speech Bimodal Emotion Recognition Of Children
6	Research On Speech Separation Algorithm Based On Deep Learning
7	Research On Speaker Speech Separation In The Scene Of Wearing A Mask
8	Deep Neural Network-based Acoustic Signal Synthesis And Separation Research
9	Research On Bimodal Emotional Chinese Speech Synthesis
10	Design And Implementation Of CNN-BLSTM Speech Separation Algorithm Fused With Self-attention Mechanism