Speech is produced by the joint movement of multiple organs in the entire vocal system.Minor lesions in any organ will lead to abnormalities in volume,pitch,resonance,and articulation clarity of speech,making pathological speech.In recent years,due to problems such as people’s irregular living habits and serious aging society,the occurrence probability of pathological speech in the population is gradually increasing,and various pathological speech recognition technologies and applications have emerged as the times require.However,the existing research is basically based on the existing database to carry out relevant research,limited by the difficulty of collection and patient privacy,the sample size in the general pathological speech database is far insufficient compared with the normal speech database,which has also led to the problems of difficult and poor results in pathological speech classification and recognition.Data augmentation technology optimizes model performance by generating more training data,and has been applied in many fields of speech and images.Therefore,using data augmentation technology to alleviate the small amount of data in the pathological speech database is an important way to further improve the effect of pathological speech recognition.Since the traditional pathological speech data augmentation method is based on established rules,the augmented speech tends to aggregate in high-dimensional space,resulting in a lack of diversity in generated speech.However,the data augmentation method based on the Generative Adversarial Networks(GAN)can directly sample the required data from the random signal,and generate more diverse voices.In this thesis,aiming at the small amount of data in pathological speech classification and recognition,we focus on the method of augmenting pathological speech data based on GAN,and propose a Dilated Convolutional Generative Adversarial Network with frequency loss(DFGAN),and a pathological speech recognition system under data augmentation was constructed based on the proposed network.The proposed DFGAN comprehensively considers the shortcomings of the existing network structure and loss function.Firstly,since the existing model structure is difficult to capture the multi-scale features of speech signals,the multi-scale expansion rate is used to capture pathological speech features,and auxiliary features are introduced to guide the process of pathological speech generation,so the direct modeling of pathological speech signals is achieved.Then,starting from the disadvantage that the loss function of the existing GAN does not consider the high-frequency migration of pathological voice domain energy,an adaptive frequency domain energy function is designed in the proposed DFGAN to capture the energy in different frequency bands.Based on this,the proportion of each frequency band loss in the calculation of the loss of generated speech and the original speech is determined,and the joint optimization process of the adversarial loss and the proposed adaptive frequency domain loss is given.The constructed pathological speech recognition system under data augmentation includes a data augmentation module based on the proposed model and a back-end recognition module.The data augmentation module uses the training set to train the DFGAN model and to generate augmented data.The back-end recognition module uses the training set and the augmented set to train the pathological speech classifiers and the end-to-end speech recognition model.The result of the test set is taken as the final recognition result.In order to fully verify the improvement of DFGAN in pathological speech classification and recognition,experiments were carried out on four commonly used pathological speech databases.In the classification recognition experiment between normal speech and pathological speech,the method proposed in this paper improves the accuracy by an average of 4.16%,and the improvement is more obvious for speech with lower original accuracy.In the pathological speech-to-text recognition experiment based on the end-to-end model,the proposed method comprehensively reduces the word error rate by 2%to 6%,and the improvement in low intelligibility pathological speech recognition is more obvious.In the comparative analysis with the existing data augmentation methods,the contribution of the proposed method to the pathological speech recognition research under the small amount of data is confirmed. |