| Massive video is sharing out throught the Internet in each minute.The well-known video sharing site youtube upload video per minute up to 100 + hours.The total time of video uploaded to the wellknown video sharing site youtube can be more than 100 hours.It is necessary to classify theses uploaded video according to the content in response to multimedia information explosion for userfriendly.It,video analysis,could also be the tool to improve website traffic and analyse the need of business.In this paper deep learning and video and video analysis techniques are combined to develop a video analysis system based on deep learning.The system employs C3 D network and CNNs network of deep learning to extract background features and action characteristics.Then the features extracted will pass through the multilayer LSTM network.After a series of weighted operations,the background features and action characteristics based on possibility of weight will be translated into description to complete the caption work.In order to recognize the background and action features in video efficiently,this paper proposes an improved architecture based on CNNs model--C3 D model.Compared with the traditional CNNs model,the C3 D model refines the convolution operation and pooling operation in CNNs which the temporal features in video are added to the original spatial sequence correlation,that is,3d convolution operation and pooling operation.These changes enable to extract and preserve more featured and improve the accuracy of background recognition and action recognition.In order to caption the video with great accuracy according to the features extracted,a multi-layer LSTM model based on single layer LSTM model also is proposed in this paper.The features extracted from the top layer of C3 D tend to focus on the global visual perception field,while those extracted from the bottom layer are more focused on the fine-grained and local features.An effective and accurate caption should not only focus on the top-layer macroscopic features but also the low-layer detail features.In this paper,we propose a multilayer LSTM model to extract the low-layer features and top-layer features at the same time to more accurately describe the video.In the end,this paper presents the realization of the main functional modules of the video analysis system based on deep learning and the experimental data.Through analysis of these results,the system meets the actual demand,has a strong engineering value and practical value. |