| Speech emotion recognition has an important application background in improving user experience,human-computer interaction,fraud detection,etc.The existing speech emotion recognition methods based on spectrograms are mainly to find the common feature classification of speech,but they cannot reflect the ability to distinguish different rhythms of speech,thus affecting the effect of speech emotion recognition.How to extract the individual rhythm difference of voice emotion becomes the key to improve the effect of emotion recognition.In view of the above problems,this paper conducts in-depth research,and the main work includes the following three points:Firstly,under the same category of emotional speech,due to different expression rhythms,there will be differences in the energy distribution of speech.If features are extracted under a unified dimension,there will be deviation.Based on the difference of speech rhythm features,this paper proposes a speech emotion recognition method based on energy frame time-frequency fusion.The key is to screen the spectrum of the high-energy regions in the speech,and reflect the individual voice rhythm differences with the distribution of high-energy speech frames and time-frequency changes.On this basis,an emotion recognition model based on convolution neural network CNN and cyclic neural network RNN is established to realize the extraction and fusion of the time and frequency changes of the spectrum.On the open data set IEMOCAP,the experiment shows that the speech emotion recognition based on speech rhythm difference is 1.05% and 1.9% higher in weighted accuracy WA and unweighted accuracy UA than the method based on spectrogram.Then,how to select the appropriate weak beat frame and the strong beat frame found by the energy to form a strong and weak combination of rhythm to better reflect the individual rhythm difference.A self-attention time-frequency fusion network method based on spectral information entropy is proposed.The core is to first screen out noise segments and strong energy frames by using short-term energy and short-term zero crossing rate through voice activation detection,calculate spectral probability density of each signal frequency component in the remaining voice frames,and obtain spectral information entropy of each frame based on the calculation formula of information entropy.Then select the frames with strong information entropy,and put the strong and weak beats into the CNN+RNN network in chronological order for time-frequency feature fusion.RNN adds a self-attention mechanism to reduce forgetting and reflect the internal relationship between multiple frames.The experiment shows that on the IEMOCAP dataset,the weighted accuracy index WA is improved by 0.5% and the unweighted accuracy index UA is improved by 0.6% on the basis of the first work.Compared with the method based on spectrogram,the average improvement in WA and UA indicators is 1.55% and 2.5%.Finally,the financial audit subsystem based on voice emotion analysis is designed and implemented.After the auditor and the user have a video question and answer dialogue,the audio stream is intercepted for speech emotion recognition.After selecting the algorithm of strong beat and weak beat of audio,the time sequence frame is reconstructed and put into the above model for feature extraction and emotion recognition.Visualize users’ acoustic statistical characteristics,speech atlas,spectrum and emotional changes.The system can not only recognize voice emotion,but also help anti-fraud risk control personnel to conduct emotional analysis on loan users.In summary,there are differences in speech speed,rhythm and other aspects of emotional expression among individuals.The existing models cannot better reflect the rhythm differences among individuals.How to show the rhythm differences of emotional expression among individuals with good accuracy.To solve this problem,from the perspective of energy frame and information entropy,this paper proposes a frequency domain spectral line selection method based on energy frame and spectral information entropy and a time-frequency fusion network model based on self-attention mechanism,and designs and implements a speech emotion recognition subsystem.Through experiments and system verification,the methods and models proposed in this paper have important theoretical value in the presentation of individual differences in speech emotion recognition,and have good application value when facing voice data sets with large individual differences and obvious rhythm changes. |