| With the development of the times,human beings begin to pursue a more intelligent life.Human action recognition,as an indispensable research direction to achieve artificial intelligence,has attracted extensive attention from research communities and business circles.Traditional action recognition methods are mainly based on hand-crafted features,which have various limitations and cannot meet the needs of human beings at present.In recent years,deep learning has achieved great success in the field of computer vision,which also provides a new way for the research of action recognition.A large number of human action recognition methods based on deep learning have been proposed,which promotes the application and development of action recognition.This paper focuses on the research of the human action recognition method based on the two-stream network,and improves the existing methods based on twostream network in terms of the recognition accuracy and processing speed,respectively.The main work is as follows:Firstly,a spatiotemporal heterogeneous two-stream network based on long-range temporal structure modeling is proposed.Considering that human recognition and understanding of appearance and motion are two completely different processes,while most existing two-stream network models adopt the same structure for spatial and temporal networks.Therefore,this paper proposes a spatiotemporal heterogeneous two-stream network,which uses two different network structures to process spatial and temporal information.In order to maximize the performance of spatiotemporal heterogeneous two-stream networks,ResNet and BN-inception are used as basic networks to extract more discriminant spatiotemporal features.In addition,a segmental architecture is employed to model long-range temporal structure over video sequences to better distinguish the similar actions owning subaction sharing phenomenon.Moreover,combined with the strategy of data augment,a modified cross-modal pre-training strategy is proposed to further improve the recognition accuracy.Experiments on UCF101 and HMDB51 datasets demonstrate that the proposed spatiotemporal heterogeneous two-stream network outperforms the spatiotemporal isomorphic two-stream networks and other related methods.Secondly,aiming at the problem of high computational cost and poor real-time performance of optical flow in current two-stream method,a real-time action recognition method based on enhanced motion vector is proposed.By replacing optical flow with motion vectors,a Spatiotemporal Heterogeneous Two-stream Network Based on Motion Vector(MV-STH)network is constructed,which reduces the computational complexity and realizes real-time processing of video sequences.Motion vectors are widely used in various video compression standards.They can be directly obtained by decoding without additional calculation.However,motion vector lacks fine structures,leading to the evident degradation of recognition performance.Thus,a knowledge transfer strategy is introduced to initialize MV-STH network using the pre-training model learnt from optical flow,which is called Spatiotemporal Heterogeneous Two-stream Network Based on Enhanced Motion Vector(EMV-STH)network.This method achieves a comparable recognition performance to some stateof-the-art approaches on UCF-101 and HMDB-51.More importantly,the processing speed is about 13 times of the spatiotemporal heterogeneous two-stream network based on optical flow. |