Font Size: a A A

Research On Pathological Voice Detection And Classification Method Based On Multimodal Data Fusion

Posted on:2023-09-29Degree:MasterType:Thesis
Country:ChinaCandidate:H F ShanFull Text:PDF
GTID:2530307055450954Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Pathological voice affects life communication and reduces the quality of life.It will also damage the vocal cords and surrounding muscle tissues,resulting in hidden dangers to health.Voice analysis technology is used to analyze speech signals,which can realize the objective evaluation of voice quality,and has clinical guiding significance for the diagnosis and treatment of laryngeal diseases.At present,the voice state is mainly determined by analyzing the speech signal.However,only using the speech signal containing hoarseness to analyze the voice state will lead to the lack of vocal cord vibration information,lead to the imperfect voice characteristics,and then affect the accuracy of voice classification.In order to extract comprehensive voice features and improve the accuracy of pathological voice classification,this thesis proposed a multimodal data fusion pathological voice detection and classification method.The speech signal to measure the hoarseness state of voice and the electroglottography(EGG)signal to measure the vibration state of vocal cord are associated to realize feature complementarity,so as to enhance the feature expression ability related to pathological voice and obtain stable and sound voice features.According to the nonlinear characteristics of pronunciation process,the short-time Fourier transform(STFT)technology is used to map the time-domain signal to the spectrum containing more voice characteristics.Mel filter is designed to deal with the noise such as blasting sound and unvoiced sound in the spectrum,and finally Mel spectrum is obtained as the network input.In order to improve the training effect,transfer learning ResNet-18 model is used as a multi-modal backbone network to extract the characteristics of speech signal and EGG signal from Mel spectrum.In the pathological voice detection task,the multimodal features extracted by the backbone network are fused by splicing,and the long-short term memory(LSTM)network is designed to learn the voice time series information after feature fusion.In the pathological voice classification task,the multimodal transfer module(MMTM)is introduced to fuse the two modal features of convolution layers with different spatial dimensions in the backbone network to recalibrate the voice features of each convolution layer.The multimodal compact bilinear pooling(MCB)fully integrates the characteristics of the two modal outputs.The fully connected layers were used for voice pathological diagnosis and classification according to the voice characteristics after fusion.The experiment in this thesis is based on the speech signal and EGG signal samples in Saarbrucken voice database(SVD).In the pathological voice detection task,the average accuracy,recall,specificity and F1 score reached 95.73%,96.73%,95.48% and 96.10% respectively.In the pathological voice classification task,the average accuracy,recall,specificity and F1 score reached 98.30%,98.23%,98.40%and 98.23% respectively.Experimental results show that the classification accuracy of this method is significantly improved compared with other methods,and this method can be effectively applied to the study of voice pathology.
Keywords/Search Tags:Pathological Voice Classification, Multimodal, Data fusion, MMTM, EGG Signal, Speech Signal
PDF Full Text Request
Related items