Paralinguistic research aims to better understand and utilize nonlinguistic information in human speech,which may include the speaker’s emotional state,social background,etc.By studying paralinguistic information,we can understand and interpret speech information more accurately,so as to better develop and apply speech technologies such as emotion recognition.With the advent of modern computing devices,it has become possible for computers to automatically process paralinguistic information.In order to study paralinguistic information better and apply speech paralinguistic information reasonably to daily life,this thesis studies the tasks in three directions of intelligent processing of paralinguistic information.They are escalation detection,mask speech recognition and stuttering speech detection respectively,involving medical detection in paralanguage,human-computer interaction and emotional analysis.The main contributions of this thesis are as follows:(1)In terms of reflecting the paralinguistic information of dialogue scenes,this thesis uses an escalation detection system based on Cycle-Consistent Adversarial Network(Cycle GAN)for the intelligent detection of escalation,and divides the long-segment samples after the training samples are decomposed into frames,windowing and other preprocessing operations are performed to obtain corresponding short-segment samples,and then the designed Cycle GAN network is used for data augmentation on these short-segment samples.These short-segment samples are then input to the pretrained model to extract the features of the short-segment samples.Finally,the short-segment sample features extracted by the pre-training model are fed to the designed classifier to obtain the shortsegment sample judgment result,and then the short-segment sample judgment results are aggregated to obtain the long-segment sample judgment result.Experimental results show that,compared with other models without sample augmentation,the proposed method has achieved a certain performance improvement.(2)In terms of reflecting the paralinguistic information of the state of the speaker,this thesis uses a neural network system based on multi-scale one-dimensional convolution for the mask speech intelligent recognition task.First,for the segment training samples,all low-level training sample sets are obtained and Low-Level Discriptors(LLDs)feature are extracted.Then,these low-level samples are converted into long-segment samples after corresponding preprocessing operations,and then input into a specially designed one-dimensional convolutional deep neural network with a branch structure,and the optimal long-segment deep network model is obtained through training.Secondly,decompose the corresponding development set samples to obtain low-level samples and perform the same preprocessing operation to convert them into long-segment samples,input the trained network to obtain the judgment of long-segment samples,and then aggregate the judgment results to obtain the sentence segment classification results.Compared with the traditional method without pre-training model,the performance has been improved to a certain extent.(3)In terms of paralinguistic information that reflects the speaker’s habits,for the stuttering speech intelligent detection system,this thesis uses a feature representation method based on a selfsupervised pre-training model.Using the wav2 vec pre-training model based on self-supervised learning to extract feature representations of audio samples,by extracting feature representations of different Attention Layers in the Transformer network,semantic information at different levels can be captured to achieve more diverse feature representations.These feature representations are then fed to a designed classifier for classification.Compared with the traditional model,the results of this method have been improved to a certain extent,which provides an idea for the feature representation method of the stuttering speech detection task. |