Few-shot action recognition aims to recognize novel action classes (query) using just a few samples (support). The majority of current approaches follow the metric learning paradigm, which learns to compare the similarity between videos. Recently, it has been observed that directly measuring this similarity is not ideal since different action instances may show distinctive temporal distribution, resulting in severe misalignment issues across query and support videos. In this paper, we arrest this problem from two distinct aspects -- action duration misalignment and action evolution misalignment. We address them sequentially through a Two-stage Action Alignment Network (TA2N). The first stage locates the action by learning a temporal affine transform, which warps each video feature to its action duration while dismissing the action-irrelevant feature (e.g. background). Next, the second stage coordinates query feature to match the spatial-temporal action evolution of support by performing temporally rearrange and spatially offset prediction. Extensive experiments on benchmark datasets show the potential of the proposed method in achieving state-of-the-art performance for few-shot action recognition.
翻译:微小的动作识别旨在仅仅使用少数样本(支持)来识别新型行动类别(尖叫)。当前大多数方法都遵循衡量学习模式,学会比较视频之间的相似性。最近,观察到直接测量这种相似性并不理想,因为不同的行动实例可能显示独特的时间分布,导致查询和支持视频之间的严重不匹配问题。在本文件中,我们从两个不同方面来抓这一问题 -- -- 行动时间错配和行动演进错配。我们通过一个两阶段行动协调网络(TA2N)来相继解决这些问题。第一阶段通过学习时间折线变换来定位行动,将每个视频特征扭曲到其行动持续时间,同时排除与行动相关的特征(例如背景)。接下来,第二个阶段协调查询特征,以匹配通过进行时间后移和空间抵消的预测支持的空间时序行动演变。关于基准数据集的广泛实验显示,拟议的方法有可能实现少数动作的状态性表现。