Our goal in this paper is the adaptation of image-text models for long video retrieval. Recent works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP, effectively hitchhiking on the image-text representation for video tasks. However, there has been limited success in learning temporal aggregation that outperform mean-pooling the image-level representations extracted per frame by CLIP. We find that the simple yet effective baseline of weighted-mean of frame embeddings via query-scoring is a significant improvement above all prior temporal modelling attempts and mean-pooling. In doing so, we provide an improved baseline for others to compare to and demonstrate state-of-the-art performance of this simple baseline on a suite of long video retrieval benchmarks.
翻译:本文的目标是为长期视频检索修改图像文本模型。最近的工作通过采用CLIP,展示了视频检索方面的最先进性能,有效地搭乘了视频任务图像文本的演示。然而,在学习时间汇总方面,成功率有限,超过了CLIP为每个框架提取的图像水平演示的平均值。我们发现,通过查询校对嵌入框架的加权手段的简单而有效的基线比以往所有时间模拟尝试和平均集合都大有改进。我们这样做,为其他人提供了一个更好的基线,以便他们在一套长视频检索基准上比较和展示这一简单基线的最新表现。