Searching troves of videos with textual descriptions is a core multimodal retrieval task. Owing to the lack of a purpose-built dataset for text-to-video retrieval, video captioning datasets have been re-purposed to evaluate models by (1) treating captions as positive matches to their respective videos and (2) assuming all other videos to be negatives. However, this methodology leads to a fundamental flaw during evaluation: since captions are marked as relevant only to their original video, many alternate videos also match the caption, which introduces false-negative caption-video pairs. We show that when these false negatives are corrected, a recent state-of-the-art model gains 25\% recall points -- a difference that threatens the validity of the benchmark itself. To diagnose and mitigate this issue, we annotate and release 683K additional caption-video pairs. Using these, we recompute effectiveness scores for three models on two standard benchmarks (MSR-VTT and MSVD). We find that (1) the recomputed metrics are up to 25\% recall points higher for the best models, (2) these benchmarks are nearing saturation for Recall@10, (3) caption length (generality) is related to the number of positives, and (4) annotation costs can be mitigated through sampling. We recommend retiring these benchmarks in their current form, and we make recommendations for future text-to-video retrieval benchmarks.
翻译:搜索具有文本描述的视频存档是一项核心的多模态检索任务。由于缺乏专门针对文本到视频检索的数据集,因此视频字幕数据集已被重新用于通过以下方式评估模型:(1)将字幕视为其相应视频的正匹配,和(2)假设所有其他视频都是负信息。但是,这种方法在评估过程中存在一个根本缺陷:由于仅将字幕标记为与其原始视频相关,许多其他视频也与其匹配,这引入了假阴性的字幕-视频匹配。我们表明,当这些假阴性被修正后,最近的最先进模型的召回率提高了25个百分点,这种差异威胁到基准本身的有效性。为了诊断和缓解这个问题,我们注释并发布了683K个额外的字幕视频对。使用这些数据,我们重新计算了两个标准基准(MSR-VTT和MSVD)上三个模型的有效性分数。我们发现:(1)最佳模型的重新计算度量值高出25个百分点,(2)这些基准正在接近Recall@10的饱和度,(3)字幕长度(一般性)与阳性数量有关,(4)可以通过抽样来缓解注释成本。我们建议以其当前形式退休这些基准,并为未来的文本到视频检索基准提供建议。