| With the soaring development of technology pertinent to internet and electronic equipment,nowadays smart phones,wearable electronic devices and many other smart mobile terminal devices are gradually popularized,which promotes the continuous expansion of the scale of egocentric video data.Egocentric action recognition research has broad and huge application prospects in human-machine interaction,risk warning,health monitoring and other fields,attracting the attention of a large group of scholars in the deep learning field.This thesis mainly studies the cross-modality fusion based egocentric action recognition method,while using the RGB video stream,as well as the acceleration and angular velocity information of inertial sensor to solve the problem of egocentric action recognition for a series of common behavior categories in daily life scenes.The specific research contents in this thesis are as follows:1.Based on 3D convolutional neural network architecture,this thesis studies the multi-branch feature extraction for action recognition method.For the two-branch information interaction within video feature network,Slow Fast,an adaptive fusion method of dynamic downsampling for intermediate features is introduced,according to the motion features extracted by LSTM from inertial motion data or video feature itself.Then video section fidelity score is generated for the construction of this module.The experimental results carried out on this thesis’ dataset prove the validity of the feature dynamic downsampling fusion mechanism in Slow Fast network.2.This thesis studies a feature extraction network for video and inertial motion based on Transformer architecture.To solve the problem of high computing cost in vision Transformer on video tasks,it can be effectively solved with spatiotemporal array feature representation and decoupling spatiotemporal self-attention calculation by step.Furthermore,temporal and spatial sparse self-attention and spatial local mask are introduced to emulate the advantages of convolutional network design.Finally,the cross-modal fusion result of the two-pathway features are output based on the global average pooling and cross-correlation fusion module.Finally,the experimental results of trained on this thesis’ dataset show that the algorithm and model of this chapter can effectively improve the effect of video action recognition.3.This thesis studies a local-to-global temporal multi-modal Transformer as feature extraction model,which splits video frames,as well as acceleration and angular velocity data into paired local short-term snippet input information with a certain time synchronization,then fine-grained temporal fusion modeling is realized through single-modal local temporal Transformer and cross-modal global temporal Transformer respectively.To promote the effect of video classification learning under the multi-modal setting,a modality equilibrium regularization constraint task is further proposed to make both modalities fully cooperate and complement with each other.The ultimate experimental results prove that the accuracy of behavior recognition of this multi-modal Transformer network algorithm has been improved to some extent. |