We present a general framework for compositional action recognition -- i.e. action recognition where the labels are composed out of simpler components such as subjects, atomic-actions and objects. The main challenge in compositional action recognition is that there is a combinatorially large set of possible actions that can be composed using basic components. However, compositionality also provides a structure that can be exploited. To do so, we develop and test a novel Structured Attention Fusion (SAF) self-attention mechanism to combine information from object detections, which capture the time-series structure of an action, with visual cues that capture contextual information. We show that our approach recognizes novel verb-noun compositions more effectively than current state of the art systems, and it generalizes to unseen action categories quite efficiently from only a few labeled examples. We validate our approach on the challenging Something-Else tasks from the Something-Something-V2 dataset. We further show that our framework is flexible and can generalize to a new domain by showing competitive results on the Charades-Fewshot dataset.
翻译:我们提出了一个整体行动识别总体框架 -- -- 即:行动识别,标签由主题、原子动作和对象等更简单的组成部分组成。在组合行动识别方面的主要挑战是,有一组组合的庞大可能的行动,可以使用基本组成部分组成。然而,组成性也提供了一个可以利用的结构。为此,我们开发并测试一个新型的结构关注聚合(SAF)自控机制,将物体探测信息集成,捕捉行动的时间序列结构,并配有可捕捉背景信息的视觉提示。我们表明,我们的方法比艺术系统当前状态更有成效地承认新动词-noun组成,它从几个有标签的例子中非常高效地概括了看不见的行动类别。我们验证了我们对某些东西-V2数据集中具有挑战性的东西-Else任务的做法。我们进一步表明,我们的框架是灵活的,可以通过在Charades-Fewshot数据集上显示竞争性的结果,将一个新的领域加以概括。