For the majority of the machine learning community, the expensive nature of collecting high-quality human-annotated data and the inability to efficiently finetune very large state-of-the-art pretrained models on limited compute are major bottlenecks for building models for new tasks. We propose a zero-shot simple approach for one such task, Video Moment Retrieval (VMR), that does not perform any additional finetuning and simply repurposes off-the-shelf models trained on other tasks. Our three-step approach consists of moment proposal, moment-query matching and postprocessing, all using only off-the-shelf models. On the QVHighlights benchmark for VMR, we vastly improve performance of previous zero-shot approaches by at least 2.5x on all metrics and reduce the gap between zero-shot and state-of-the-art supervised by over 74%. Further, we also show that our zero-shot approach beats non-pretrained supervised models on the Recall metrics and comes very close on mAP metrics; and that it also performs better than the best pretrained supervised model on shorter moments. Finally, we ablate and analyze our results and propose interesting future directions.
翻译:对于大多数机器学习界来说,收集高质量的人文附加说明数据的费用昂贵,以及无法有效微调有限计算方面最先进的先进预先培训模型,这些是建立新任务模型的主要瓶颈。我们建议对其中一项任务,即录相瞬时检索(VMR),不做任何额外的微调,而只是将经过其他任务训练的现成模型重新定位。我们的三步方法包括:即时提议、瞬间查询匹配和后处理,所有方法都只使用现成模型。在VMR的QVHighlights基准上,我们大大改进了以往零点方法的性能,至少在所有指标上提高了2.5x,并缩小了零点与最新艺术之间的差距,超过74%。此外,我们还表明我们的零点方法在重新召指标上击败未经事先训练的受监督模型,并且非常接近 mAP的衡量标准;在VMRMR标准上,它也比最先进的预受监督模型在较短的时效。最后,我们分析了未来方向并提出了令人感兴趣的方向。