Font Size: a A A

Research On Single-Channel End-to-End Target Speech Extraction Models

Posted on:2022-11-26Degree:MasterType:Thesis
Country:ChinaCandidate:W S ZhangFull Text:PDF
GTID:2518306767463164Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Target speaker extraction technology separates the target speech from mixture with the help of auxiliary information,and provides front-end support for speech recognition,classification and other applications.With the industrial development of artificial intelligence,intelligent voice technology has a good development prospect in the smart home,work meetings,voice interactive device and other fields,which put forward strict requirements on the perceived quality and intelligibility of speech.However,the target speech is inevitably polluted by various interference,which brings new opportunities and challenges to target speaker extraction technology.It is a challenging problem to accurately and stably extract target speaker from the mixture because of the uncertain and complicated interference.Moreover,traditional speaker extraction models are only optimized for single overlapping situation,and fail in multiple everyday conversational situations.We aim to propose the single-channel end-toend target speaker extraction models with high performance,strong generalization ability and strong robustness based on practical applications.Our research contents and contributions are as follows:1.Considering the problem of accurately extracting the clean target speech from mixture where the target speaker is active,we propose a single-channel end-to-end target speech extraction model based on an improved dual-path recurrent neural network(DPRNN).We adopt an improved DPRNN as our target speaker extraction network.The design greatly integrates the input features and embedding in the intra-block and inter-block,further improving the speaker adaptability of the model.In the subsequent structure,it fully utilizes the local and global information of the fusion features,which can handle the global information of utterance-level sequence and effectively improve the ability of the model to extract the target speaker.Experimental results demonstrate that the proposed model can effectively and stably extract target speaker from the overlapping speech.Compared with existing models,it exhibits more outstanding robustness and generalization ability.2.Considering the problem of effectively extracting the ideal target speech from mixture where the target speaker may be absent,we propose an universal target speaker extraction model based on target speaker voice activity detection(TS-VAD).By introducing TS-VAD task to help the model identify the activity of the target speaker,our model can handle multiple everyday conversational situations and extract high-quality target speech.Secondly,we combine pre-training and fine-tuning as our model training method and propose a multi-task loss function based on proportional weights,aiming to alleviate the degrading effect of silent samples and optimize joint training for different conditions.Experimental results show that our model has high performance and strong adaptation to all mixing conditions,in particular,to the scene in which the target speaker is absent.
Keywords/Search Tags:Target Speaker Extraction, Dual-Path Recurrent Neural Network, Universal Speaker Extraction, Voice Activity Detection
PDF Full Text Request
Related items