We introduce S$^2$VS, a video similarity learning approach with self-supervision. Self-Supervised Learning (SSL) is typically used to train deep models on a proxy task so as to have strong transferability on target tasks after fine-tuning. Here, in contrast to prior work, SSL is used to perform video similarity learning and address multiple retrieval and detection tasks at once with no use of labeled data. This is achieved by learning via instance-discrimination with task-tailored augmentations and the widely used InfoNCE loss together with an additional loss operating jointly on self-similarity and hard-negative similarity. We benchmark our method on tasks where video relevance is defined with varying granularity, ranging from video copies to videos depicting the same incident or event. We learn a single universal model that achieves state-of-the-art performance on all tasks, surpassing previously proposed methods that use labeled data. The code and pretrained models are publicly available at: \url{https://github.com/gkordo/s2vs}
翻译:我们引入了S$^2$VS,一种带有自监督的视频相似度学习方法。自监督学习(SSL)通常用于在代理任务上训练深度模型,以便在微调后在目标任务上具有强大的可迁移性。在这里,与之前的工作相反,我们使用SSL来执行视频相似度学习,并同时处理多个检索和检测任务,而无需使用标记数据。这通过使用任务定制增强和广泛使用的InfoNCE损失进行实例区分来实现,并且同时使用操作自相似度和硬负相似度的附加损失。我们在任务中进行基准测试,其中视频相关性的定义具有不同的粒度,从视频副本到描述相同事件或事件的视频。我们学习一个单一的通用模型,在所有任务上都实现了最先进的性能,超过了使用标记数据的先前提出的方法。代码和预训练模型可在\url{https://github.com/gkordo/s2vs}上公开获取。