| Objective: There are two problems in the realization of human action recognition.The spatial state of the target captured by cameras at different positions is different,the robustness of the human action recognition model under a single perspective is insufficient,and the motion amplitude and contribution of each region of the human body are different between classes.The local motion state of the human body is also a key issue affecting human action recognition.In this paper,the skeleton data obtained by the Kinect depth camera is used for human action recognition research,and an experimental study is carried out on the action recognition and multi-angle feature fusion of the block color-coded skeleton data under different viewing angles.Methods: Firstly,this paper preprocesses the skeletal data.a template matching method based human motion trajectory capture algorithm for video sequences,3V-MHIs.The spatial coordinates of 20 skeletal joint points captured by the depth camera are used for 3D reconstruction of the skeleton frame and block coloring,and the output is obtained by using the three perspectives respectively.Three Skl-Color sequences.The R,G,B single-channel motion history map is obtained for the Skl-Color sequence after channel separation by the inter-frame difference method,and finally the RGB-MHIs are obtained by channel merging.A dual-stream heterogeneous network model is proposed,and the spatial and temporal features of skeleton data are learned respectively through the heterogeneous model composed of Res Net and Conv LSTM.Experiments are carried out on the Conv LSTM model with different layers and hidden sizes.The feature fusion module is added to use four methods of three fusion strategies for the features obtained by the Res Net model: averaging,maximizing,and two channel stacking methods for experimental comparison.Based on the difference between spatial and temporal features,heterogeneous models are fused in series as the final input data of the model.Results: This paper conducts experimental research on the public human action recognition dataset UTD-MHAD.In the single-view Res Net model,the front-view RGB-MHIf achieves the highest recognition accuracy(92.77%),and the top-view input model is relatively low(90.47%),because the top-view based on 3D space mapping will produce a certain amount of Then the three isomorphic Res Net-50 models were used for feature fusion,and the recognition accuracy rates of 96.18%,97.18%,96.65%,and 96.65% were obtained respectively.The experiment found that the number of four hidden layers was 64.The Conv LSTM model can achieve a relatively good recognition accuracy(91.01%);finally,the dual-stream heterogeneous model through spatiotemporal feature fusion achieves a final recognition accuracy of 98.58% on the UTD-MHAD dataset.Conclusion: In this paper,a 3V-MHIs algorithm is proposed based on the motion history map and the depth motion map to solve the problem of different target states from multiple angles,which well preserves the local motion features of the human body.The Res Net used in the dual-stream heterogeneous model combined with Conv LSTM to model and analyze the image compactness spatial features and multi-frame sparsity temporal features of human actions,and achieved a high recognition accuracy on the UTD-MHAD dataset.The study provides a certain reference for human action recognition using skeletal data or heterogeneous network models. |