| Sound classification is the basic research work in multimedia information process-ing.It is the core technology of sound data structuring.It is significant in the fields of signal processing and speech recognition.Many fields have urgent needs for the high-performance sound classification systems.In recent years,with the development of deep learning technology,the combination of deep neural network and audio data processing analysis has become a new research hotspot.Convolutional neural network which is one of the most representative in the deep learning has achieved remarkable results in sound classification tasks.This paper focus on the sound classification method based on the convolutional neu-ral network models.Firstly,this paper proposes a multi-scale time domain convolutional network(WaveM-sNet)with feature fusion mechanism for the problem that it is difficult to extract strong discriminant features from audio data.In the study,we analyzed the dilemma of the con-volutional neural network in the extraction of waveform signals,that is,the convolution filters cannot be distributed across the full frequency band while improving the frequency resolution.Under this problem,the features we extracted through the network cannot represent the audio information effectively.To this end,we propose a multi-scale time domain convolution operation to increase the discrimination of features.At the same time,we also propose a feature fusion method,which combines the waveform features extracted by the network and the two-dimensional time-frequency features in the same network.On the sound classification datasets ESC-10 and ESC-50,multi-scale time do-main convolution operations can improve the classification accuracy by 1.95%and 2.82%on average.After using the feature fusion method,the classification performance of our system exceeds the previous related work.Secondly,in order to solve the problem of poor generalization ability of the acoustic classification model under the insufficient labeled data,we propose a hybrid sample learn-ing method for audio data.In the training of neural networks,in order to reduce the per-formance difference between the training set and the test set,data enhancement is widely used.It makes a variety of data while keeping the data semantic information unchanged.Although,the deformation enriches feature patterns and improves the generalization per-formance of the network,it treats each sample independently and does not consider the changes between samples.The relationship between different samples is ignored as a re-sult.In this paper,we consider whether it is possible to construct a pattern of features from a sample pair(two samples),to learn the relationship and differences between pairs of same or different classes of samples.We propose a hybrid-sample-based learning al-gorithm for convolutional neural networks,which can be applied to various convolutional neural network architectures.In order to explore the better sample hybrid method,we pro-pose a variety of sample hybrid methods for two audio features,time-frequency features and waveform features.The performances of these methods are verified by comparison in different network architectures.On the DCASE2018Task2 dataset,our proposed Overlay method get a maximum performance increase of 3.68%and 3.27%for time-frequency and waveform features. |