Research On Low Intrusive Multi-modal Speech Separation Methods

Posted on:2024-09-08

Degree:Master

Type:Thesis

Country:China

Candidate:W N Li

Full Text:PDF

GTID:2568307073962089

Subject:Control Science and Engineering

Abstract/Summary:

PDF Full Text Request

The current multi-modal speech separation methods have achieved a significant improvement in accuracy and stability compared with unimodal methods due to the simultaneous use of audio-visual modal information.However,the current multi-modal speech separation methods are based on visual feature extraction under high-definition images,which requires a large amount of computing resources and makes it difficult to meet the realtime requirements of robot applications.Therefore,as the importance of personal privacy increases,exploring multimodal speech separation methods that can simultaneously balance separation performance and protect user privacy is gradually becoming a research hotspot.The paper summarizes the current speech separation and visual feature extraction methods,pointing out the problems and limitations of the existing methods.To address the privacy protection problem of current multi-modal speech separation methods,this thesis conducts research on low-intrusive multi-modal speech separation methods based on deep learning,exploring quantitative evaluation methods for visual privacy intrusiveness,constructing visual modal models,and realizing multi-modal speech separation methods that can protect user privacy.The specific studies are as follows:For visual privacy evaluation,a quantitative method for computing visual privacy intrusiveness is proposed.This method is able to give a resolution threshold at which the image resolution is not privacy intrusive,and the face feature extractor is unable to distinguish between different face features when the face image resolution is under the threshold.This can be used for quantitative visual privacy intrusiveness evaluation.To solve the problem of difficult extraction of visual dynamic features under lowresolution image information,a speaker visual modality model with a fast and slow dualbranch structure is constructed,which can simultaneously extract the spatial semantic features of the face and the lip motion features during the speaker’s speech.In order to verify the effectiveness and feasibility of the proposed method,the proposed low-resolution visual modal model is combined with the current mainstream pure speech separation model,respectively.The experimental results show that the proposed model can maintain good separation performance even with low-resolution face images.As an example,on the DPRNN,on the LRS3,LRS2 and GRID datasets,the method obtained 1.3%,1.8% and 10.9% improvement in SI-SNRi and 1.6%,2.0% and 6.0% improvement in SDRi,respectively,compared with the unimodal speech separation method.To solve the problems of missing visual information and difficult correspondence between visual and audio information in low-resolution images,a speaker visual modality model with a fast-medium-slow three-branch structure is proposed.The three-branch structure extracts the corresponding phoneme-level,word-level,and utterance-level dynamic information in the visual modality and adopts multi-stage feature fusion to maintain the consistency of the audiovisual multi-level dynamic features.The experimental results show that the model can maintain good separation performance at low resolution and is useful in solving the problem of poor fusion due to the large variability of audiovisual features in multimodal speech separation.As an example,on the DPRNN,on the basis of LRS3,LRS2,and GRID,the network achieves the improvement of 4.3%,10.6%,and 26.5%,respectively,in terms of the SI-SNRi,and the improvement of 4.6%,11.3%,and 21.8%,respectively,in terms of SDRi,compared with the results obtained by applying the unimodal speech separation model in the same conditions.

Keywords/Search Tags:

Speech separation, Multimodality, Privacy preservation, Low resolution, Deep learning

PDF Full Text Request

Related items

1	Speech Separation Based On Deep Learning
2	Multi-speaker Speech Separation Based On Deep Learning
3	Research On Speech Separation And Recognition Based On Deep Learning
4	Speech Separation Method And Implementation Based On Deep Learning
5	Research Of Speech Separation Technology Of Multiple Speakers Based On Deep Learning
6	Rsearch And Implementation Of Single Channel Speech Separation With Unknown Number Of Speakers
7	Speech Separation Technology Based On Deep Learning
8	Research On Multi-Speaker Speech Separation And Speech Recognition In Noisy Environment
9	Binaural Speech Separation Research Based On Deep Learning
10	Research On Speech Separation Technology Based On Deep Learning