For video captioning, "pre-training and fine-tuning" has become a de facto paradigm, where ImageNet Pre-training (INP) is usually used to encode the video content, then a task-oriented network is fine-tuned from scratch to cope with caption generation. This paper first investigates the impact of the recently proposed CLIP (Contrastive Language-Image Pre-training) on video captioning. Through the empirical study on INP vs. CLIP, we identify the potential deficiencies of INP and explore the key factors for accurate description generation. The results show that the INP-based model is tricky to capture concepts' semantics and sensitive to irrelevant background information. By contrast, the CLIP-based model significantly improves the caption quality and highlights the importance of concept-aware representation learning. With these findings, we propose Dual Concept Detection (DCD) further to inject concept knowledge into the model during training. DCD is an auxiliary task that requires a caption model to learn the correspondence between video content and concepts and the co-occurrence relations between concepts. Experiments on MSR-VTT and VATEX demonstrate the effectiveness of DCD, and the visualization results further reveal the necessity of learning concept-aware representations.
翻译:关于视频字幕,“培训前和微调”已成为事实上的范例,即通常使用图像网预培训(INP)来对视频内容进行编码,然后对面向任务的网络进行从零到零的微调,以适应字幕生成。本文件首先调查最近提议的CLIP(语言图像前培训)对视频字幕的影响。通过对INP诉CLIP的经验性研究,我们查明INP的潜在缺陷,并探索准确描述生成的关键因素。结果显示,基于INP的模型难以捕捉概念的语义和对无关的背景资料敏感。相比之下,基于CLIP的模型大大改进了标题质量,突出了概念标识学习的重要性。有了这些发现,我们建议了双重概念探测(DCD)进一步将概念知识引入培训模式。DCD是一项辅助性任务,需要用一个字幕模型来学习视频内容和概念之间的对应关系以及概念之间的共生关系。关于MSR-VTTT和VATEX的实验,进一步展示了MSR-VATT和视觉演示结果的必要性。