The goal of this paper is to bypass the need for labelled examples in few-shot video understanding at run time. While proven effective, in many practical video settings even labelling a few examples appears unrealistic. This is especially true as the level of details in spatio-temporal video understanding and with it, the complexity of annotations continues to increase. Rather than performing few-shot learning with a human oracle to provide a few densely labelled support videos, we propose to automatically learn to find appropriate support videos given a query. We call this self-shot learning and we outline a simple self-supervised learning method to generate an embedding space well-suited for unsupervised retrieval of relevant samples. To showcase this novel setting, we tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting, where the goal is to segment instances at the pixel-level across the spatial and temporal domains. We provide strong baseline performances that utilize a novel transformer-based model and show that self-shot learning can even surpass few-shot and can be positively combined for further performance gains. Experiments on new benchmarks show that our approach achieves strong performance, is competitive to oracle support in some settings, scales to large unlabelled video collections, and can be combined in a semi-supervised setting.
翻译:本文的目的是绕过在短镜头视频实时理解中贴上标签的例子的必要性。 虽然事实证明,在许多实际的视频环境中,甚至贴上几个例子,这些例子都显得不切实际。 特别是,由于在时空视频理解中的细节程度以及随之而来,说明的复杂性在继续增加。 我们建议,不要用人文眼镜进行少镜头学习以提供几部密集标记的支持视频,而是自动学习以找到合适的支持视频。 我们称之为自拍学习,我们概述了一种简单的自我监督学习方法,以生成一个嵌入的空间,适合不受监督地检索相关样本。 要展示这种新颖的设置,我们首先处理的是自拍(和几张照片)的视频实例分解,目的是在空间和时间范围内提供几部密集标记支持视频的分解。 我们提供强大的基线性业绩,利用新的变光模型模型,显示自拍学习甚至能够超过少镜头,并且可以积极结合进一步的业绩成果。 在新的基准中,我们首先处理的视频实例分解,然后在新的基准中进行实验,我们的共同级的集可以实现强的超标。