Human activity recognition in videos has been widely studied and has recently gained significant advances with deep learning approaches; however, it remains a challenging task. In this paper, we propose a novel framework that simultaneously considers both implicit and explicit representations of human interactions by fusing information of local image where the interaction actively occurred, primitive motion with the posture of individual subject's body parts, and the co-occurrence of overall appearance change. Human interactions change, depending on how the body parts of each human interact with the other. The proposed method captures the subtle difference between different interactions using interacting body part attention. Semantically important body parts that interact with other objects are given more weight during feature representation. The combined feature of interacting body part attention-based individual representation and the co-occurrence descriptor of the full-body appearance change is fed into long short-term memory to model the temporal dynamics over time in a single framework. We validate the effectiveness of the proposed method using four widely used public datasets by outperforming the competing state-of-the-art method.
翻译:人类活动在视频中的认知得到了广泛的研究,最近随着深层次的学习方法而取得了重大的进展;然而,这仍然是一个具有挑战性的任务。在本文件中,我们提出了一个新的框架,既考虑人类互动的隐含和明确表现,又考虑通过在相互作用积极发生的地方图像中隐藏信息,原始运动与个别主体身体部分的姿势,以及整体外观变化的共生关系。人类互动的变化,取决于每个人的身体部位如何与他人互动。拟议方法利用互动体部位的注意来捕捉不同互动之间的微妙差异。在特征展示期间,与其他对象互动的具有重大意义的身体部分得到更大的重视。互动体部位的个人代表以及全体外貌变化的共同描述者的综合特征被注入长期的短期记忆中,以模拟时间动态在一个单一框架内的模型。我们用四种广泛使用的公共数据集来证明拟议方法的有效性,因为优于竞合的状态方法。