Font Size: a A A

Research On The Violent Shot Detection Based On Audio And Video Feature Fusion

Posted on:2020-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:C Z ShaoFull Text:PDF
GTID:2428330590973231Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Violent shot detection is an extremely important task in the field of multimedia video,which has high research value and pratical significance.At present,the number of multimedia video increases day by day,which brings higher requirements for the speed of violent shot detection.Moreover,there are many semantic types involved in violent shots,including fighting,screaming,explosion,etc.,which also brings great challenges to the detection task of violent shots.At present,most studies only involve a certain type of violence,and the detection type is relatively single and the accuracy is low.Therefore,the rapid detection technology for violent shots with multiple semantic types is urgently needed.Firstly,this paper is based on the principle that the emergence of violence generally takes the shot as the most basic unit.The multimedia video is segmentated by the shot,and then we judge whether a single shot is violent.Shot segmentation of video sequences is one of the key technologies in video information processing,especially video retrieval.Traditional shot segmentation methods have low detection rate for the gradient shot and the abrupt shot,especially in a single scene.To deal with this problem,this paper proposes a video segmentation method based on visual cognition mechanism.This method proposes a block granularity color histogram to strengthen the visual salient area,and a highlight measure to describe the difference between the front and back frames.This brings great improvements to the accuracy of detecting shot switching in a single scene.In addition,based on the brightness visual perception in video,the difference between adjacent multi-frames in the sliding window is used to capture the brightness change for the gradient shots.Comparing with traditional methods,the proposed algorithm achieves better segmentation effect and has higher precision and recall rate.Secondly,the paper analyzes the violence of a single shot from the visual channel,auditory channel and audio-visual dual channels.In terms of visual channel,the paper compares the dense trajectory feature method in the field of video behavior analysis with the widely used deep learning method.In the deep learning method,the paper takes the inter-frame difference graph of two adjacent frames as the input of the CNN(Convolutional Neural Network),and then sends the features of each inter-frame difference graph learned by CNN into the LSTM(Long Shot-Term Memory)network to model the timing signal.In the paper,convolution operation is used to improve the LSTM structure,and the improved ConvLSTM network extracts higher spatial features.In terms of audio channel,aiming at the scarcity of violent audio datasets at present,the paper constructs a VioAudio dataset based on the film data of MediaEval and then compares the traditional acoustic feature method with the deep learning method using the original audio waveform and audio spectrum as the network input.Finally,the fusion experiment is carried out on the deep learning model with the best results in visual channel and auditory channel.We send the inter-frame difference graph of the adjacent image frames in video and its corresponding audio waveform respectively to the two CNN networks for feature extraction,and then combine the features into the LSTM network for the modeling and classification of timing information.The experiments show the effectiveness of the proposed method.The research work of the paper provides an effective solution for current shot segmentation tasks and violence shot detection in the multimedia video.Experiments on multiple datasets show that the method proposed in the paper is feasible and pratical.At the same time,the audio-video fusion scheme also provides new ideas and directions for the current multimodal information fusion.
Keywords/Search Tags:shot segmentation, audio and video feature fusion, convolutional neural network, long short term memory network, violence detection
PDF Full Text Request
Related items