| Speaker recognition technology is a pattern recognition technology that completes identity authentication through the physiological characteristics and individual behavior characteristics contained in speech signals.Speaker recognition technology is widely used in various fields,and has broad development prospects in information authentication,criminal investigation,financial security and other fields.Research on speaker recognition technology has great practical significance.However,most of the existing speaker recognition systems only use single-modal acoustic features or kinematic features,and use traditional classifier gaussian mixture models,support vector machines,etc.,which have problems of poor generalization performance and difficulty in improving recognition accuracy in practice.In view of the above reasons,this paper starts with kinematic features,fuses them with acoustic features,and improves the deep neural network to improve the speaker recognition system.The main research work is as follows:(1)For traditional kinematic features,there are problems that the extracted pronunciation action parameters are insufficient and the recognition rate is low,a reference point articulatory movement features extraction algorithm is proposed.The articulatory movement parameters of tongue,lip and mandible respectively relative to the bridge of the nose are extracted to obtain the reference point articulatory movement features.At the same time,the pronunciation action cepstral coefficient is extracted from the motion trajectory based on the Bark domain and Mel cepstral feature extraction process,and the information at low frequencies is further obtained.The speaker feature difference analysis is carried out on the acoustic features,the reference point articulatory movement features and the pronunciation action cepstral coefficient,and the validity of the reference point articulatory movement features and the pronunciation action cepstral coefficient proposed in this paper is verified.(2)Aiming at the problem that it is difficult to improve the accuracy of speaker recognition only using single modal acoustic features or kinematic features,a speaker recognition method combining acoustic features and kinematic features is proposed.Firstly,acoustic features such as prosodic features and gammatone filter cepstral coefficient are extracted and their statistical characteristics are calculated.Secondly,the reference point articulatory movement features and pronunciation action cepstral coefficient were extracted.Finally,the acoustic statistical features are fused with the improved articulatory movement features,and the embedded feature selection is carried out to remove the redundant features and obtain the dual-mode fusion features.Gaussian mixture model,support vector machine and deep neural network classifier are used to classify speakers.(3)In order to make full use of the information contained in speaker features and further improve the performance of speaker recognition system.This paper proposes a deep neural network speaker recognition algorithm which combines gaussian statistical features with small batch gradient descent algorithm.Firstly,the Gaussian statistical features are extracted from the original speech features by using the maximum a posteriori algorithm.Then,the small-batch gradient descent optimization algorithm is used to reduce the complexity of the model and the training time of the system.Finally,a deep neural network model of the small-batch gradient descent algorithm is constructed,and the original feature space is transformed into a linearly separable space related to the speaker through the nonlinear mapping of the deep neural network.In this paper,data sets of healthy people and patients with dysarthria are extracted from the TORGO database and self-built database for speaker recognition experiments,the experimental results show that the reference point articulatory movement features and pronunciation action cepstral coefficient are superior to the traditional articulatory movement features,and the recognition rate of the dual-modal fusion feature is significantly higher than that of the single-modal feature.Then TORGO database and self-built database are used as the whole experiment object for speaker recognition experiment.The experimental results show that the improved deep neural network classifier has better recognition performance than other classifiers.This paper combines bimodal fusion features with an improved deep neural network classifier to form a recognition system that can significantly improve speaker recognition accuracy.The method proposed in this paper has important theoretical significance and reference value for the research of speaker recognition technology. |