Research On Video Retrieval Algorithm Based On Model-based Transfer Learning

Posted on:2024-03-11

Degree:Master

Type:Thesis

Country:China

Candidate:F Y Zhao

Full Text:PDF

GTID:2568307106968289

Subject:Communication engineering

Abstract/Summary:

PDF Full Text Request

Due to the scarcity of video dataset with text annotations,it is difficult to construct and train retrieval models directly in video text retrieval task.Therefore,transfer learning is introduced,and pre-trained models were introduced to reduce the need of data.For the weak correlation between the pre training task of the language model and the retrieval task,text can not be effectively represented in previous work.So further research on transfer problem of text encoders in video text retrieval based on multimodal features was conducted in this paper.The main content and innovation points are as follows:Focusing on the analysis of the MMT(Multi-modal Transformer),it was found that there is a input mismatch between the pretraining task and video text retrieval task of BERT model(Bidirectional Encoder Representation for Transformers)used in MMT.To solve this problem,a video text retrieval algorithm based on CLIP(Contrastive Language-Image Pre-training)text encoder was proposed.Compared with BERT,CLIP was trained on a large amount of text image data,which is similar to the video retrieval,and the input of CLIP text encoder is more matched.In order to make full use of negative samples,Max-margin ranking loss in MMT was replaced by symmetric cross entropy loss.In order to reduce the negative impact of randomly initialized parameters,a two-step training scheme is designed.Specifically,CLIP text encoder was frozen in the first step training and then fine tuned in the second step.In addition,learning rate decay strategy was discussed in order to better train the model.The experimental results show that the proposed algorithm performs better in text retrieval video tasks on the MSRVTT dataset compared with the reference algorithm and the indicators of R@1,R@5 and R@10 increased by 3.7%,4.3% and 4.2%respectively.In order to further reduce the impact of randomly initialized parameters,a video text retrieval algorithm based on Net VLAD was proposed.There are characteristics of less parameters and stronger local feature fusion ability in Net VLAD network.The proposed algorithm choose Net VLAD to fuse multimodal features.At the same time,a gated embedding module was introduces in the output layer to improve the video representation.In addition,in order to avoid the one-way optimum-match which occurs in contrastive methods,DSL(Dual Softmax Loss)was introduced.The experimental results show that the proposed algorithm performs better in text retrieval video tasks on the MSRVTT dataset compared with the reference model and the indicators of R@1,R@5 and R@10 increased by 5.5%,6.3% and 5.7% respectively...

Keywords/Search Tags:

Cross-modality, Video text retrieval, Transfer learning, NetVLAD, CLIP

PDF Full Text Request

Related items

1	Study Of Content-Based Video Clip Retrieval
2	Research On DNN-based Cross-Modality Media Analysis
3	Research On Cross-modality Person Re-identification Based On Deep Learning
4	Research On Cross Modal Image And Text Retrieval Methods Based On Pretraining Model
5	Study Of Content-Based Video Clip Retrieval
6	Design And Implementation Of Cross-modal Retrieval For Video
7	Deep Learning Based Video-Text Cross-Modal Retrieval
8	Research On Cross-modality Re-identification Based On Deep Learning
9	Research On Cross-Modality Person Re-Identification Algorithm Based On Deep Learning
10	Research On Video Text Retrieval Algorithm Based On Relational Network