| Efficiently extract the changing time and space information in the video,and eliminate the complex background influence in the moving video,retain all the details of the moving subject as much as possible are the keys to improving the accuracy of action recognition.This paper introduces the deep learning semantic segmentation model into the action recognition task,and proposes a lightweight video action recognition algorithm based on image segmentation.The algorithm uses the improved U-Net++ semantic segmentation network to separate the foreground and background of the action video,retains the moving human body,and eliminates the influence of the background;then uses the proposed dense 3D convolutional neural network to extract the temporal and spatial features of the video to complete video action recognition.In terms of image segmentation,this paper prunes,optimizes and trains the U-Net++ medical image segmentation network based on the deep supervision mechanism,and obtains the optimized U-Net++ L3 semantic segmentation network for human body semantic segmentation.At the same time,the UCF101 image segmentation data set is constructed for network optimization and testing.Experiments show that the proposed U-Net++L3 pruning optimization network has improved accuracy compared with the baseline network on the LIP human body semantic segmentation dataset;Compared with the SOTA model,the parameters of the network are effectively reduced with a slight loss of accuracy,the model reasoning speed is greatly improved,and the best balance between accuracy and speed is achieved,which is more in line with the lightweight requirements of the model.For video action recognition,this paper designs a dense 3D convolutional block as the basic unit,constructs a dense 3D convolutional neural network,efficiently extracts the spatiotemporal features of action videos,and strengthens the transfer and reuse of features in the network.At the same time,in order to make the features extracted by the network sufficiently distinguishable,a joint loss function based on Fisher’s discriminant regularization term is proposed to increase the inter-class dispersion of the network extracted features,reduce the intra-class dispersion,and improve the accuracy of action recognition.The experimental results on the UCF101 action recognition dataset show that the dense 3D convolutional neural network has better performance in terms of parameters and accuracy than the benchmark network,which verifies the effectiveness of the proposed method.Finally,based on the above algorithm,this paper designs and implements an image segmentation video action recognition system,which performs real-time human body semantic segmentation and action recognition on the input video stream,and records the recognition results,improving the system’s ability to understand video and realizes the efficient application of algorithms. |