Research On Egocentric Action Recognition Method Based On Cross-modal Fusion

Posted on:2024-04-16

Degree:Master

Type:Thesis

Country:China

Candidate:Y X Zhou

Full Text:PDF

GTID:2568307079955469

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

With the soaring development of technology pertinent to internet and electronic equipment,nowadays smart phones,wearable electronic devices and many other smart mobile terminal devices are gradually popularized,which promotes the continuous expansion of the scale of egocentric video data.Egocentric action recognition research has broad and huge application prospects in human-machine interaction,risk warning,health monitoring and other fields,attracting the attention of a large group of scholars in the deep learning field.This thesis mainly studies the cross-modality fusion based egocentric action recognition method,while using the RGB video stream,as well as the acceleration and angular velocity information of inertial sensor to solve the problem of egocentric action recognition for a series of common behavior categories in daily life scenes.The specific research contents in this thesis are as follows:1.Based on 3D convolutional neural network architecture,this thesis studies the multi-branch feature extraction for action recognition method.For the two-branch information interaction within video feature network,Slow Fast,an adaptive fusion method of dynamic downsampling for intermediate features is introduced,according to the motion features extracted by LSTM from inertial motion data or video feature itself.Then video section fidelity score is generated for the construction of this module.The experimental results carried out on this thesis’ dataset prove the validity of the feature dynamic downsampling fusion mechanism in Slow Fast network.2.This thesis studies a feature extraction network for video and inertial motion based on Transformer architecture.To solve the problem of high computing cost in vision Transformer on video tasks,it can be effectively solved with spatiotemporal array feature representation and decoupling spatiotemporal self-attention calculation by step.Furthermore,temporal and spatial sparse self-attention and spatial local mask are introduced to emulate the advantages of convolutional network design.Finally,the cross-modal fusion result of the two-pathway features are output based on the global average pooling and cross-correlation fusion module.Finally,the experimental results of trained on this thesis’ dataset show that the algorithm and model of this chapter can effectively improve the effect of video action recognition.3.This thesis studies a local-to-global temporal multi-modal Transformer as feature extraction model,which splits video frames,as well as acceleration and angular velocity data into paired local short-term snippet input information with a certain time synchronization,then fine-grained temporal fusion modeling is realized through single-modal local temporal Transformer and cross-modal global temporal Transformer respectively.To promote the effect of video classification learning under the multi-modal setting,a modality equilibrium regularization constraint task is further proposed to make both modalities fully cooperate and complement with each other.The ultimate experimental results prove that the accuracy of behavior recognition of this multi-modal Transformer network algorithm has been improved to some extent.

Keywords/Search Tags:

Egocentric Action Recognition, Three-dimensional Convolution Network, Spatiotemporal Self-attention Mechanism, Vision Transformer Architecture, Cross-modal Fusion

PDF Full Text Request

Related items

1	Video Action Research Based On Attention Mechanism And Spatiotemporal Fusion Network
2	Research On Skeleton Action Recognition Algorithm Based On Attention Mechanism
3	Action Recognition And Temporal Action Localization Based On Attention Mechanism
4	Research On Action Recognition Method Based On Key Frame And Attention Mechanism
5	Research On Skeleton Action Recognition Algorithm Based On Spatiotemporal Attention Mechanism
6	Research On Key Technologies Of Video Action Recognition Based On Spatio-Temporal Transformer
7	Accurate Spatiotemporal Action Detection For Videos In Complex Scenes
8	Skeleton Action Recognition Study Basted On Collaborative Spatiotemporal Attention
9	Research On Human Action Recognition In Videos Based On Deep Learning
10	Action Recognition Based On Convolution Recurrent Neural Network With Attention Mechanism