Font Size: a A A

Bimodel Speech Separation Based On Dynamic Feature Points

Posted on:2024-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:Q LuFull Text:PDF
GTID:2568306941992679Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In noisy environments,people are able to focus on a particular conversation and ignore background noise,a phenomenon known as the "cocktail party problem".Additionally,people can improve their ability to hear in acoustically complex environments by observing the speaker’s facial expressions.In today’s world of widespread intelligent devices,it is an important research direction in speech front-end processing to enable machines to have this ability and expand their application scenarios.Speech separation,as a method for separating the target speaker’s voice from a mixed signal containing multiple voices,is an important technology for solving this problem.Furthermore,utilizing visual modality information to improve the quality of speech separation when conditions permit is an important way to further optimize the service level of intelligent devices.When constructing multimodal models,how to design better model structures and optimize the way information is exchanged among multiple objectives are all issues that need to be explored.In response to the above situation,this paper conducts research on a multimodal speech separation model based on deep learning and dynamic lip feature points.The main contents are as follows:This paper first conducts research on single-modal models,constructing speech separation models using two approaches: time-domain coding separation and time-frequency domain separation.Performance comparisons were conducted through experiments.The experimental results show that the time-domain coding model,which uses a trainable encoder and decoder to decompose and reconstruct speech,has better separation performance,and therefore,it was extended as a baseline model to multimodal models.In the multimodal speech separation task,preprocessing operations that include speaker videos were added,and the extracted dynamic lip feature points were integrated into the speech separation task.The structure of the multimodal model was then studied,adding visual and auditory feature extraction modules,with feature channel concatenation as the feature fusion method.Separation modules based on convolution and dual-path recurrent networks were constructed.The experimental results show that the model has better separation performance when incorporating visual modality,and the dual-path recurrent separation module is more effective than convolution when modeling long sequences of speech.Finally,based on the shortcomings in feature fusion of the existing models,this paper designed a dual-attention multimodal model that fully considers the feature interaction among multiple objectives,effectively avoiding the "order" problem among multiple visual features.This model not only uses fewer network parameters,but also further improves the speech separation performance.
Keywords/Search Tags:speech separation, feature points, bimodal, attention
PDF Full Text Request
Related items