| With the continuous progress of science and technology and the continuous development of deep learning,video description is a research topic that adapts to the development of the times and is the focus of many domestic and foreign researchers.After years of development,a series of achievements have been achieved.Video descriptions require cross modal tasks that handle images,text,and audio.The goal of video description is to provide a video segment and generate a description based on video content information.The dataset used is annotation text for a short video and video.Dense video description is a more complex video description task than video description,requiring analysis and processing of a longer video.There are multiple events in this long video,and multiple events need to be extracted and described separately.The main task of this article is to study the two tasks of video description and dense video description.A video description method that integrates visual and speech features has been proposed.In response to the lack of guidance on audio information in existing video description tasks,a video description that integrates visual and speech features is proposed.The Mel-Frequency Cepstrum is used to extract features from the audio information in the video,which is added to network training to add feature information of different modalities in model training.It can guide the model to generate more effective text information.In addition,in the visual feature extraction stage,this article uses a relatively advanced model framework,Vision Transformer,to further accurately obtain visual feature information,providing more effective information for subsequent text generation.A large number of comparative experiments were conducted on the MSVD and MSR-VTT datasets,with ROUGE_L indicator has increased by 0.2 and1.2.Proposed a dense event description method based on visual semantic embedding.Each event in the existing research is localized only according to visual clues,while ignoring the relevant Semantic information and context information.A dense event description method based on visual semantic embedding is proposed,which consists of visual semantic embedding,hierarchical description transformation,and proposed network modules.By capturing contextual information associations of words,obtaining more effective word information through n-grams,and finally achieving joint embedding of visual semantics and text information.Finally,embedding temporal information through hierarchical description transformation,and finally partitioning and extracting video events through proposed network modules.Finally,it is fed into a multi head self-attention network for text generation.Conduct experiments and validation in the Activity Net dataset,and in ROUGEL_L has increased by1.9,leading the current average level. |