We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set. Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches. Video representations are formed from ordered tuples of varying numbers of frames, which allows sub-sequences of actions at different speeds and temporal offsets to be compared. Our proposed Temporal-Relational CrossTransformers (TRX) achieve state-of-the-art results on few-shot splits of Kinetics, Something-Something V2 (SSv2), HMDB51 and UCF101. Importantly, our method outperforms prior work on SSv2 by a wide margin (12%) due to the its ability to model temporal relations. A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order relational CrossTransformers.
翻译:我们建议一种新颖的方法来识别几发动作, 在支持组的查询和视频之间找到时间对应框架图例。 与以往的几发作品不同, 我们使用交叉传输关注机制构建了类原型, 以观察所有支持视频的相关次序列, 而不是使用类平均值或单一最佳匹配。 视频演示由数量不等的定序图例组成, 允许以不同速度和时间偏移进行动作的次序列比较。 我们拟议的时间- 关系交叉转换( TRX) 在几发“ 动因、 某物 V2 (SSv2)、 HMDB51 和 UCF101 ” 上取得了最新的结果。 重要的是, 我们的方法由于能够模拟时间关系( 12% ), 大大超越了先前在 SSv2 上的工作 。 一个详细的缩略图显示匹配多个支持设置的视频和学习更高顺序交叉转换器的重要性 。