| With the development of deep learning,artificial intelligence has brought great convenience to human social life.As an important branch of video content analysis,video caption promotes the further development of video retrieval and video personalized recommendation.The video captioning is to use natural language to describe the visual content contained in the video,and the description sentence is required to have accuracy,readability and fluency.At present,in the research of video caption algorithms based on encoder-decoder model,the advanced semantic information of video is used as video semantic features,which can effectively assist the decoding model to more accurately convert video visual features into caption.Among them,the quality of video semantic features has an important impact on the accuracy of the caption generated by the decoding model.Therefore,in the encoding stage,in view of the low accuracy of the video semantic features extracted by the existing video semantic detector,this paper constructs a video semantic feature enhancement encoder model,and enhances the encoding feature through the highway layer structure,and adds the video semantic word difference amplification module,amplify the differences between semantic words in semantic features,and improve the accuracy of video semantic features.The experimental results show that the quality of the semantic features generated by the proposed algorithm is better,and it can more effectively assist the decoding model to improve the accuracy of the generated caption.In order to further improve the accuracy of the caption generated by the decoding model,in the decoding stage,the decoding model cannot give more attention to the important words of the video content during the learning process,and the difference between the word features is small In this paper,the word attention mechanism is combined with the word difference enhancement structure to build a word feature enhanced text decoder model,which makes the word features both important and different,and improves the performance of the decoding model.Through comparative experiments on standard datasets,experiments show that the caption generated by the algorithm in this paper is more appropriate to the video content.The generated caption is not only accurate,but also reflects the details in the video.At the same time,compared with other algorithms in the same field,the evaluation index of the caption generated by the algorithm in this paper is significantly better than other algorithms in similar research. |