We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set. Distinct from previous few-shot action recognition works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches. Video representations are formed from ordered tuples of varying numbers of frames, which allows sub-sequences of actions at different speeds and temporal offsets to be compared. Our proposed Temporal-Relational CrossTransformers achieve state-of-the-art results on both Kinetics and Something-Something V2 (SSv2), outperforming prior work on SSv2 by a wide margin (6.8%) due to the method's ability to model temporal relations. A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order relational CrossTransformers. Code is available at https://github.com/tobyperrett/trx
翻译:我们建议一种新颖的方法来识别几发动作, 找到在支持组中的查询和视频之间对应的时间框架图例。 与以往的微发动作识别工程不同, 我们使用交叉传输关注机制构建了类原型, 以观察所有支持视频的相关子序列, 而不是使用类平均值或最佳匹配。 视频演示由不同数量框架的定购图例组成, 允许以不同速度和时间偏移的动作序列子序列进行对比。 我们拟议的时空- 关系交叉转换者在动因和某些东西V2( SSv2) 上都取得了最新结果, 由于该方法能够模拟时间关系, 从而大大超过先前在SSv2上的工作( 6.8% ) 。 详细的缩略图展示了匹配多个支持设置的视频和学习更高顺序的关联交叉转换者的重要性 。 代码可在 https://github.com/tobyert/trx 上查阅 https://github. com/tobypert/ texx