| Video has been widely used in all aspects of our lives,Because of its richness,intuitiveness and vividness in content.However,with the rapid development of the Internet,the scale of video data has dramatically increased.It takes a lot of manpower to analyze and manage massive amounts of video data artificially.Therefore,The Multimedia Event Detection(MED)task has come into being in recent years,and become a hot research in the field of computer vision and video retrieval.In recent years,deep learning continues to make major breakthroughs in the field of image,which provides a very effective reference for other areas of deep learning.However,there is not a mature network structure for complex video tasks such as MED.In this paper,multimedia event detection based on multi-modal feature is explored in detail.According to the advantages and disadvantages of the existing frameworks,which is the semantic-based and average-frame-based methods,the main work of this paper is as follows:1.First of all,combined with the advantages of deep learning and traditional feature aggregation methods,CNN and VLAD are applied to video event detection and achieved good results.2.Secondly,according to the hierarchical,structural and complexity of video multimedia,this paper extracts the audio features from the task of multimedia event detection experimentally and combine it with the visual features as complement.In response to the lack of task samples for multimedia event detection,a set of effective feature extraction framework is built.3.Finally,a multimedia event detection system based on multi-modal features was built,tested on multiple data sets,participated in the TRECVID 2017 MED and won the second place.It verified the effectiveness of the multimedia event detection framework and the algorithm proposed in this paper. |