Research On Video Captioning Based On Deep Learning

Posted on:2024-07-21

Degree:Master

Type:Thesis

Country:China

Candidate:Y F Yang

Full Text:PDF

GTID:2568307082483084

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

Video captioning is an important subject to be overcome in the development of artificial intelligence.It is a critical indicator of the development stage of artificial intelligence,and has a diverse range of applications in the field of human-robot interaction,security system and other aspects.With the rapid advance of civilization,how to process the data using video as information carrier quickly has attracted wide attention of researchers.Existing research has two main directions.The first is models based on traditional methods,which detect objects first and then fill the detected objects into the designed templates.The second is models applying deep learning methods,which extract visual features from the videos first and decode the features to generate the corresponding captions later.Early video captioning models based on traditional methods are unsustainable due to the significant increase in model complexity and computational requirements when dealing with open domain scenes and large datasets.The rapid progress of deep learning methods makes it possible to greatly improve the performance of video captioning models,but some aspects are still calling for improvement.The purpose of this thesis is to explore possible solutions and improvements based on deep learning methods for video captioning by analyzing the shortcomings of existing models,and propose model that can generate captions with higher quality for the practical application of video captioning.The specific works can be summarized as follows:1)A video captioning algorithm based on multimodal feature fusion and attention mechanism is proposed.Existing encoder-decoder structure models based on deep learning lack further processing of the features directly extracted from the convolutional neural networks,which makes the redundant information interfere with the model seriously and cannot fully represent the key content of the videos.Meanwhile,the use of text information is insufficient,which makes the model lack of important supervision when converting visual information.In addition,the generation process of captions in existing models is not transparent enough.To solve the above problems,the video captioning algorithm based on multimodal feature fusion and attention mechanism uses complementary information between different features to generate differentiated features representing different part-of-speech in captions,in which process the attention mechanism is used to remove the redundancy of different features.In order to enhance the use of text information,the algorithm uses the part-of-speech of the ground-truth captions to supervise the feature selection process,so that the generation process of captions can be observed more clearly.The proposed algorithm achieves competitive results on two commonly used video captioning datasets.2)A video captioning algorithm with guidance signals based on attention mechanism is proposed.This algorithm is an improvement of the previous algorithm.In the previous algorithm,the separated design of the sub-modules weakens the connection between different words.Moreover,the algorithm does not make enough use of the decoded text information,and cannot dynamically adjust according to the previous output when generating differentiated features.To alleviate these deficiencies,a video captioning algorithm with guidance signals based on attention mechanism is proposed.The algorithm generates dynamic guidance signals by using rich visual information,text information and historical information.With the introduction of guidance signals,the algorithm can adjust the generated features which represent words of different part-of-speech dynamically.The correlation of generated features and decoded words is much stronger,thus enhancing the semantic consistency of the final results.The feature selection and decoding process of the model are also positively influenced by the guidance signals.Experiments on two datasets show that the model achieves good results,and the addition of guidance signals improves the performance a lot.

Keywords/Search Tags:

Video Captioning, Long Short-Term Memory, Attention, Multimodal Feature Fusion, Guidance Signals

PDF Full Text Request

Related items

1	Image Captioning Based On Attention Long Short-Term Memory Network
2	Research On Intelligent Semantics Generation For Visual Data
3	Research On Rumor Detection Based On Optimizing Multimodal Model Structure
4	Research And Implementation Of Image Captioning Technology Based On Deep Learning
5	Research On Video Captioning Methods Based On Visual Text Association And Multimodal Feature Fusion
6	Research On Image Captioning Algorithm Based On Deep Learning
7	Image To Language:Auto Image Captioning Using Bi-directional LSTM And Deep Attention Neural Networks
8	Research On The Violent Detection Of Audio And Video Based On Attention Mechanism
9	Research On Computer Vision Image Captioning Based On Deep Learning
10	Research On Visual Semantic Graph Construction And Its Application In Image Captioning