Few-shot video classification aims to learn new video categories with only a few labeled examples, alleviating the burden of costly annotation in real-world applications. However, it is particularly challenging to learn a class-invariant spatial-temporal representation in such a setting. To address this, we propose a novel matching-based few-shot learning strategy for video sequences in this work. Our main idea is to introduce an implicit temporal alignment for a video pair, capable of estimating the similarity between them in an accurate and robust manner. Moreover, we design an effective context encoding module to incorporate spatial and feature channel context, resulting in better modeling of intra-class variations. To train our model, we develop a multi-task loss for learning video matching, leading to video features with better generalization. Extensive experimental results on two challenging benchmarks, show that our method outperforms the prior arts with a sizable margin on SomethingSomething-V2 and competitive results on Kinetics.
翻译:微小的视频分类旨在学习新的视频类别,只有几个贴标签的例子,减轻真实世界应用中昂贵的注释负担。 但是,在这种环境下学习一个等级差异性空间时空代表特别困难。 为了解决这个问题,我们为这项工作中的视频序列提出了一个新的基于匹配的少镜头学习战略。 我们的主要想法是为视频对配引入一个隐含的时间对齐,能够准确和有力地估计它们之间的相似性。 此外,我们设计了一个有效的上下文编码模块,以纳入空间和特征频道环境,从而更好地模拟类内变异。为了培训我们的模型,我们开发了学习视频匹配的多任务损失,导致视频匹配的更概括化。关于两个挑战性基准的广泛实验结果显示,我们的方法超越了先前的艺术,在某事-V2上具有相当大的优势,在基尼特斯上也具有竞争力。