| In recent years,video has become an important information carrierwith the continuous development of the Internet and intelligent devices.At the same time,for the sake of security,more and more public places have installed video surveillance equipment.With the explosive growth of the number of videos,how to let computers recognize human action in videos has become an important research hotspot.As an important research direction in the fields of computer vision and pattern recognition,action recognition technology has broad application prospects in the fields of intelligent monitoring,human-computer interaction,virtual reality,intelligent rehabilitation,and sports training.At present,many successful action recognition methods are built on the basis of the two-stream network architecture.The network is composed of two networks: spatial stream and temporal stream,which are used to extract the appearance information and motion information of the video.The objects processed by the two-stream network(RGB images and optical flow image)are easily affected by factors such as complex backgrounds and illuminated changes.When these factors change,the recognition effect will be greatly reduced.The skeleton data provides abstract information and high level features of human action,which can overcome the influence of factors such as illuminated,background,and appearance.However,the skeleton data is too refined,and the human motion information is represented by the movement of human joint points.The lack of color,shape,texture and other information in the video,so it is sometimes difficult to identify some action just from the skeleton data.Obviously,the RGB images and optical flow images processed by the two-stream network contain information such as color,shape,and texture in the video.Therefore,this paper extends the two-stream network architecture and implements a method of action recognition based on the three-stream network architecture.That is,on the basis of the two stream network architecture,a skeleton stream network branch is added.Among them,this paper uses the idea of Temporal Segment Network(TSN),uses the Res Net101 network to build spatial stream network and temporal stream network,and uses the Convolutional Neural Network(CNN)model to build a skeleton stream network.The multi-stream network not only overcomes the shortcomings of human action recognition based solely on skeleton data,but also enables the network to integrate three modal information(visual cues)of frame images of video,optical flows,and human skeleton.Complementarity can obtain more complementary information for human action recognition.Finally,this paper attempts two fusion methods of multi-stream networks.In order to demonstrate the effectiveness and feasibility of this method,This paper has done a lot of experiments on a large-scale data set NTU RGB + D.The experimental results prove that the proposed method based on multi-stream networks action recognition method has high recognition performance,can solve the shortcomings of single stream network recognition rate and instability,and has a broad application prospect and application market.In this paper,two methods are tried for the fusion method of multi-stream networks architecture: direct result average method and result weighted average method.The experimental results show that the result weighted average method is more effective. |