| With the emergence of short video platforms,video has become the main media for people to communicate,record,and share their lives.Video content description has gradually become a new research hotspot.Video co ntent description is mainly used in many fields of life and industry,such as unmanned driving,intelligent security,blind guidance,and has high research significance and practical value.Currently,video content description mainly adopts a deep learning method based on the Seq2 seq framework,which has achieved good results.However,the model still has some shortcomings in the process of feature extraction,encoding,and decoding,including: 1)Existing models do not fully utilize the features of the enc oder’s hidden layer during feature decoding;2)Failure to effectively utilize manually annotated text descriptions in the dataset;3)In the process of generating description statements,there is a lack of attention to video timing information,and there is a single focus on the global and local content of the video.The main research content of this article is as follows.1)A video content description method based on Vi T and reinforcement learning is proposed.Firstly,the Vi T encoding module is used to encode video features,and the overall encoded video features are sent to the decoding module,thereby reducing the loss of hidden layer features during the encoding process and improving the utilization rate of features in the model;Then,reinforcement l earning algorithm is used to further optimize the parameters of the model.As the model continuously predicts the descriptive vocabulary,reinforcement learning is based on the reward value of environmental feedback;Optimize model parameters to improve mo del accuracy.Experiments on an open dataset M SR-VTT show that the proposed method can achieve good performance in generating video content descriptions.Compared with the comparative model,the four evaluation indicators have improved,verifying the effectiveness of the model.2)Introducing manually annotated description text from a video description dataset,a method for improving the quality of video content description using video semantic features is proposed.Firstly,construct and train a semantic f eature extraction network to extract semantic information from videos,and fuse visual and semantic text features to improve the performance of model generated text descriptions;Then,the decoder is redesigned to introduce semantic features into the decoding module to improve the readability of the d escription generated by the model.Comparative experiments and ablation experiments were conducted on the public dataset MSR-VTT.The experiments show that the proposed method has significant improvements in various indicators,verifying that the introduct ion of semantic information can improve the quality of video content description.3)Combining global and local features of video,a method for improving the quality of video content description using multiscal e features is proposed.Firstly,the performance of upstream tasks of the model is improved by selecting a pre trained feature extraction network model from the video dataset to extract video features;Then,global and local encoders are used to construct multiscale features,mining comprehensive video information,and making full use of video information in the model;Finally,in the decoding stage,a gating unit is introduced to enhance attention to video timing information and improve the accuracy and re adability of generating description statements.Experiments on two public datasets,MSR-VTT and MSVD,have verified that the model can achieve higher performance when generating description statements,and the description statements generated by the model are more consistent with human expression habits. |