In this work we study the benefits of using tracking and 3D poses for action recognition. To achieve this, we take the Lagrangian view on analysing actions over a trajectory of human motion rather than at a fixed point in space. Taking this stand allows us to use the tracklets of people to predict their actions. In this spirit, first we show the benefits of using 3D pose to infer actions, and study person-person interactions. Subsequently, we propose a Lagrangian Action Recognition model by fusing 3D pose and contextualized appearance over tracklets. To this end, our method achieves state-of-the-art performance on the AVA v2.2 dataset on both pose only settings and on standard benchmark settings. When reasoning about the action using only pose cues, our pose model achieves +10.0 mAP gain over the corresponding state-of-the-art while our fused model has a gain of +2.8 mAP over the best state-of-the-art model. Code and results are available at: https://brjathu.github.io/LART
翻译:在这项工作中,我们研究使用跟踪和3D姿态进行动作识别的优势。为了实现这一点,我们采用拉格朗日视角来分析在人体运动轨迹上的动作,而不是在空间固定点进行分析。采用这种立场,我们可以使用人物轨迹来预测其动作。在这方面,首先我们展示了使用3D姿态来推断动作的好处,并研究了人员之间的相互作用。随后,我们提出了拉格朗日动作识别模型,通过融合人物轨迹上的3D姿态和上下文化外观。为此,我们的方法在AVA v2.2数据集上使用仅姿态和标准基准设置均实现了最新水平的表现。当仅使用姿态线索来推理动作时,我们的姿态模型相对于相应的最先进模型实现了+10.0 mAP增益,而我们的融合模型相对于最佳最先进模型具有+2.8 mAP的增益。代码和结果可在以下网址找到:https://brjathu.github.io/LART