Font Size: a A A

Research On Video Retrieval Algorithm Based On Model-based Transfer Learning

Posted on:2024-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:F Y ZhaoFull Text:PDF
GTID:2568307106968289Subject:Communication engineering
Abstract/Summary:PDF Full Text Request
Due to the scarcity of video dataset with text annotations,it is difficult to construct and train retrieval models directly in video text retrieval task.Therefore,transfer learning is introduced,and pre-trained models were introduced to reduce the need of data.For the weak correlation between the pre training task of the language model and the retrieval task,text can not be effectively represented in previous work.So further research on transfer problem of text encoders in video text retrieval based on multimodal features was conducted in this paper.The main content and innovation points are as follows:Focusing on the analysis of the MMT(Multi-modal Transformer),it was found that there is a input mismatch between the pretraining task and video text retrieval task of BERT model(Bidirectional Encoder Representation for Transformers)used in MMT.To solve this problem,a video text retrieval algorithm based on CLIP(Contrastive Language-Image Pre-training)text encoder was proposed.Compared with BERT,CLIP was trained on a large amount of text image data,which is similar to the video retrieval,and the input of CLIP text encoder is more matched.In order to make full use of negative samples,Max-margin ranking loss in MMT was replaced by symmetric cross entropy loss.In order to reduce the negative impact of randomly initialized parameters,a two-step training scheme is designed.Specifically,CLIP text encoder was frozen in the first step training and then fine tuned in the second step.In addition,learning rate decay strategy was discussed in order to better train the model.The experimental results show that the proposed algorithm performs better in text retrieval video tasks on the MSRVTT dataset compared with the reference algorithm and the indicators of R@1,R@5 and R@10 increased by 3.7%,4.3% and 4.2%respectively.In order to further reduce the impact of randomly initialized parameters,a video text retrieval algorithm based on Net VLAD was proposed.There are characteristics of less parameters and stronger local feature fusion ability in Net VLAD network.The proposed algorithm choose Net VLAD to fuse multimodal features.At the same time,a gated embedding module was introduces in the output layer to improve the video representation.In addition,in order to avoid the one-way optimum-match which occurs in contrastive methods,DSL(Dual Softmax Loss)was introduced.The experimental results show that the proposed algorithm performs better in text retrieval video tasks on the MSRVTT dataset compared with the reference model and the indicators of R@1,R@5 and R@10 increased by 5.5%,6.3% and 5.7% respectively...
Keywords/Search Tags:Cross-modality, Video text retrieval, Transfer learning, NetVLAD, CLIP
PDF Full Text Request
Related items