| Early acoustic research did not combine the detection and localization of sound signals,but focused on two separate directions:sound event detection and source localization.In recent years,the two directions have been combined to simultaneously recognize and locate sound events,which involves detecting whether a sound event occurs in an audio signal,identifying the event category,and confirming the azimuth and elevation angles of each event’s arrival direction.The SELD baseline system for the DCASE2020 sound event detection and localization task can achieve the above functions,but there are still some issues,such as poor recognition accuracy and the need to improve localization accuracy.This paper proposes the following improvements to the baseline system:(1)Firstly,the paper addresses the insufficient feature extraction problem in the network model of the SELD baseline system for the DCASE 2020 sound event detection and localization task and proposes an improved SR-Bi GRU model.Compared to the baseline system,the CNN layer is replaced with an improved convolutional block consisting of three convolutional blocks,similar to a residual structure,which increases the network depth and solves the problems of vanishing and exploding gradients.The improved convolutional block also includes a squeeze-and-excitation residual convolution module to enhance the network’s ability to extract features between data channels and spatial dimensions.The simulation results of the improved model show an ER20°error rate of 0.49,an F20°score of 61.7,an LECDlocalization error rate of 18.1for 20 degrees,and an LRCD localization recall rate of 67.7.Compared to the baseline system’s coefficients of 0.72,37.7,23.5,and 62.0 for the corresponding indicators,there is a significant improvement.(2)Secondly,the paper proposes the SR-TCN network model,an improvement based on the SR-BiGRU network model,to address the insufficient time sequence feature extraction ability of the model.The SR-TCN model replaces the bilinear gating unit network used for temporal analysis and detection with a bilinear time sequence convolutional neural network,improving the model’s ability to extract features from data sets in terms of time continuity.The paper also introduces new fusion features and data augmentation to improve the model’s robustness.The results of the model using FOA format data show ER20°,F20°,LECD,and LRCDindicators of 0.45,65.2,16.8,and73.2,respectively.The results using MIC format data show ER20°,F20°,LECD,and LRCDindicators of 0.48,62.1,17.9,and 71.3,respectively,all better than the baseline system’s results.This paper mainly improves the baseline system for sound event recognition and localization,improving the accuracy of the model with a small time cost.In the future,sound event recognition and localization can be applied in many fields,such as helping hearing-impaired individuals identify sound categories and sources,enhancing microphone directionality in teleconferences,and helping robots navigate and interact with their surrounding environment. |