| With the development of Internet technology and the communication capability of mobile devices,the amount of multimedia data was increasing explosively.Video data,richer in information and more stimulating,has become a major part of it.It has led to many proposed video understanding tasks,such as motion detection,skeleton recognition,etc.Among them,video captioning aims to generate high-quality natural language sentences to describe the video content,which aligns with human presentation habits.Combined with large-scale language models,which have significantly developed in recent years,high-performance video captioning models can realize various functions for better human-computer interaction,such as video content retrieval,automatic video monitoring,video summarization,etc.,to further improve the use of video data.The enormous amount of video data also challenges the video captioning task.Three of the main challenges are as follows:(1)The increase in video frame rate,which implies a more intensive sampling frequency,results in many similar images in the video frame sequence.It makes the video data redundant.Related studies also confirm that this redundant information does not enhance the video captioning task but significantly increases the cost of computing power.(2)Most existing methods use a fixed number to sample the video to get the video frame sequence,leading to an inconsistent sampling interval between video frames.Therefore they cannot represent the time scale information correctly.(3)In the real world,videos are much longer and contain multiple events.Most existing methods control the data size by limiting the total number of video samples,leading to the feature of long videos being over-compressed and reducing the captioning effect.Aiming at the above challenges,in order to generate high-quality captions of video data of different lengths,this thesis conducts research on video captioning algorithms based on spatiotemporal attention,and the main work is as follows:(1)In this thesis,we propose the frame-reduced video captioning model based on spatiotemporal attention.To solve the problem of redundant information prevalent in video data,we design a feature extraction algorithm based on frame-reduced spatiotemporal attention for the video coding stage.Using the normalization characteristics of the attention matrix in spatiotemporal attention,we design a data frame redundancy degree evaluation method to gradually remove the redundant parts of the video feature sequence so that the network can focus more on the task-related parts of the video,which improves the generalization of the model and reduce the computing power consumption.To solve the problem of inconsistent video sampling interval and the lack of time-scale information caused by the removal of redundant features,we proposed a time-indexed position encoding method,which maintains the original time-scale information,improves the video feature extraction effect,and increases the accuracy of caption sentences.(2)To solve the problem that the large interval sampling method will lead to feature loss and dense sampling will lead to high computing power consumption when using longer or variable-length video as input.In this thesis,we propose a long video continuous captioning algorithm based on spatiotemporal attention feature compression.The model reads the video in a streaming mode and extracts feature layer by layer,and realizes the spatiotemporal feature linkage between multiple rounds of input through a small-scale low-level feature memory.Designs a spatiotemporal attention-based feature compression method to control the scale of video feature data to be stored within a limited range.Finally,we realize the long video continuous captioning by event detection and sentence generation module. |