Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-textual representations from large-scale web data, revealing remarkable ability for zero-shot generalisation. This paper presents a simple but strong baseline to efficiently adapt the pre-trained I-VL model, and exploit its powerful ability for resource-hungry video understanding tasks, with minimal training. Specifically, we propose to optimise a few random vectors, termed as continuous prompt vectors, that convert video-related tasks into the same format as the pre-training objectives. In addition, to bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features. Experimentally, we conduct extensive ablation studies to analyse the critical components. On 10 public benchmarks of action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, and zero-shot scenarios, we achieve competitive or state-of-the-art performance to existing methods, despite optimising significantly fewer parameters.
翻译:以图像为基础的视觉语言(I-VL)培训前,在从大型网络数据中学习视觉-文字共同表现方面表现出极大的成功,展示了对零光截图的惊人能力。本文提供了一个简单而有力的基准,以高效地调整经过训练的I-VL模型,并利用其强大的能力,在最低限度的培训下开展资源饥饿视频理解任务。具体地说,我们提议优化少数随机矢量,称为连续快速矢量,将视频相关任务转换成与培训前目标相同的格式。此外,为了缩小静态图像和视频之间的差距,时间信息被与轻量级变异器编码为在框架性视觉特征的顶端堆叠。实验性,我们进行了广泛的模拟研究,以分析关键组成部分。在10个公开的行动识别、行动定位和文字视频检索基准上,跨封闭式、微小和零速率的场景中,我们实现了与现有方法的竞争性或状态性表现,尽管选择的参数要小得多。