Font Size: a A A

Research On Low Intrusive Multi-modal Speech Separation Methods

Posted on:2024-09-08Degree:MasterType:Thesis
Country:ChinaCandidate:W N LiFull Text:PDF
GTID:2568307073962089Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
The current multi-modal speech separation methods have achieved a significant improvement in accuracy and stability compared with unimodal methods due to the simultaneous use of audio-visual modal information.However,the current multi-modal speech separation methods are based on visual feature extraction under high-definition images,which requires a large amount of computing resources and makes it difficult to meet the realtime requirements of robot applications.Therefore,as the importance of personal privacy increases,exploring multimodal speech separation methods that can simultaneously balance separation performance and protect user privacy is gradually becoming a research hotspot.The paper summarizes the current speech separation and visual feature extraction methods,pointing out the problems and limitations of the existing methods.To address the privacy protection problem of current multi-modal speech separation methods,this thesis conducts research on low-intrusive multi-modal speech separation methods based on deep learning,exploring quantitative evaluation methods for visual privacy intrusiveness,constructing visual modal models,and realizing multi-modal speech separation methods that can protect user privacy.The specific studies are as follows:For visual privacy evaluation,a quantitative method for computing visual privacy intrusiveness is proposed.This method is able to give a resolution threshold at which the image resolution is not privacy intrusive,and the face feature extractor is unable to distinguish between different face features when the face image resolution is under the threshold.This can be used for quantitative visual privacy intrusiveness evaluation.To solve the problem of difficult extraction of visual dynamic features under lowresolution image information,a speaker visual modality model with a fast and slow dualbranch structure is constructed,which can simultaneously extract the spatial semantic features of the face and the lip motion features during the speaker’s speech.In order to verify the effectiveness and feasibility of the proposed method,the proposed low-resolution visual modal model is combined with the current mainstream pure speech separation model,respectively.The experimental results show that the proposed model can maintain good separation performance even with low-resolution face images.As an example,on the DPRNN,on the LRS3,LRS2 and GRID datasets,the method obtained 1.3%,1.8% and 10.9% improvement in SI-SNRi and 1.6%,2.0% and 6.0% improvement in SDRi,respectively,compared with the unimodal speech separation method.To solve the problems of missing visual information and difficult correspondence between visual and audio information in low-resolution images,a speaker visual modality model with a fast-medium-slow three-branch structure is proposed.The three-branch structure extracts the corresponding phoneme-level,word-level,and utterance-level dynamic information in the visual modality and adopts multi-stage feature fusion to maintain the consistency of the audiovisual multi-level dynamic features.The experimental results show that the model can maintain good separation performance at low resolution and is useful in solving the problem of poor fusion due to the large variability of audiovisual features in multimodal speech separation.As an example,on the DPRNN,on the basis of LRS3,LRS2,and GRID,the network achieves the improvement of 4.3%,10.6%,and 26.5%,respectively,in terms of the SI-SNRi,and the improvement of 4.6%,11.3%,and 21.8%,respectively,in terms of SDRi,compared with the results obtained by applying the unimodal speech separation model in the same conditions.
Keywords/Search Tags:Speech separation, Multimodality, Privacy preservation, Low resolution, Deep learning
PDF Full Text Request
Related items