Video captioning is a challenging task since it requires generating sentences describing various diverse and complex videos. Existing video captioning models lack adequate visual representation due to the neglect of the existence of gaps between videos and texts. To bridge this gap, in this paper, we propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM). This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation. Besides, unlike most existing models using LSTM or GRU as the sentence decoder, we adopt a Transformer structured decoder network to effectively learn the long-range visual and language dependency. Additionally, we introduce a novel ensemble strategy for captioning tasks. Experimental results demonstrate the effectiveness of our method on two datasets: 1) on MSR-VTT dataset, our method achieved a new state-of-the-art result with a significant gain of up to 10% in CIDEr; 2) on the private test data, our method ranking 2nd place in the ACM MM multimedia grand challenge 2021: Pre-training for Video Understanding Challenge. It is noted that our model is only trained on the MSR-VTT dataset.
翻译:视频字幕是一项艰巨的任务,因为它要求生成描述各种复杂视频的句子。现有的视频字幕模型由于忽视视频和文本之间差距的存在而缺乏足够的视觉表现形式。为了缩小这一差距,我们在本文中提议了一个CLIP4Caption框架,根据CLIP增强的视频文本匹配网络(VTM)改进视频字幕框架。这个框架正在充分利用来自视觉和语言的信息,并强制执行模型,为文本生成学习强烈的文本相关视频特征。此外,与大多数使用LSTM或GRU作为句解码器的现有模型不同,我们采用了一个变换器结构解码器网络,以有效学习远程视觉和语言依赖性。此外,我们引入了新颖的配置任务组合战略。实验结果显示了我们在两个数据集上的方法的有效性:(1) 在MSR-VTT数据集上,我们的方法取得了新的状态,在CIDER中获得了高达10%的显著收益;(2) 在私人测试数据方面,我们的方法排名第二级的解码网络有效地学习了远程视觉和语言依赖性。此外,我们的方法排名是MMTF培训的多媒体挑战。