Font Size: a A A

Research On Video Captioning Based On Deep Learning

Posted on:2024-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y F YangFull Text:PDF
GTID:2568307082483084Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Video captioning is an important subject to be overcome in the development of artificial intelligence.It is a critical indicator of the development stage of artificial intelligence,and has a diverse range of applications in the field of human-robot interaction,security system and other aspects.With the rapid advance of civilization,how to process the data using video as information carrier quickly has attracted wide attention of researchers.Existing research has two main directions.The first is models based on traditional methods,which detect objects first and then fill the detected objects into the designed templates.The second is models applying deep learning methods,which extract visual features from the videos first and decode the features to generate the corresponding captions later.Early video captioning models based on traditional methods are unsustainable due to the significant increase in model complexity and computational requirements when dealing with open domain scenes and large datasets.The rapid progress of deep learning methods makes it possible to greatly improve the performance of video captioning models,but some aspects are still calling for improvement.The purpose of this thesis is to explore possible solutions and improvements based on deep learning methods for video captioning by analyzing the shortcomings of existing models,and propose model that can generate captions with higher quality for the practical application of video captioning.The specific works can be summarized as follows:1)A video captioning algorithm based on multimodal feature fusion and attention mechanism is proposed.Existing encoder-decoder structure models based on deep learning lack further processing of the features directly extracted from the convolutional neural networks,which makes the redundant information interfere with the model seriously and cannot fully represent the key content of the videos.Meanwhile,the use of text information is insufficient,which makes the model lack of important supervision when converting visual information.In addition,the generation process of captions in existing models is not transparent enough.To solve the above problems,the video captioning algorithm based on multimodal feature fusion and attention mechanism uses complementary information between different features to generate differentiated features representing different part-of-speech in captions,in which process the attention mechanism is used to remove the redundancy of different features.In order to enhance the use of text information,the algorithm uses the part-of-speech of the ground-truth captions to supervise the feature selection process,so that the generation process of captions can be observed more clearly.The proposed algorithm achieves competitive results on two commonly used video captioning datasets.2)A video captioning algorithm with guidance signals based on attention mechanism is proposed.This algorithm is an improvement of the previous algorithm.In the previous algorithm,the separated design of the sub-modules weakens the connection between different words.Moreover,the algorithm does not make enough use of the decoded text information,and cannot dynamically adjust according to the previous output when generating differentiated features.To alleviate these deficiencies,a video captioning algorithm with guidance signals based on attention mechanism is proposed.The algorithm generates dynamic guidance signals by using rich visual information,text information and historical information.With the introduction of guidance signals,the algorithm can adjust the generated features which represent words of different part-of-speech dynamically.The correlation of generated features and decoded words is much stronger,thus enhancing the semantic consistency of the final results.The feature selection and decoding process of the model are also positively influenced by the guidance signals.Experiments on two datasets show that the model achieves good results,and the addition of guidance signals improves the performance a lot.
Keywords/Search Tags:Video Captioning, Long Short-Term Memory, Attention, Multimodal Feature Fusion, Guidance Signals
PDF Full Text Request
Related items