Compared with the progress made on human activity classification, much less success has been achieved on human interaction understanding (HIU). Apart from the latter task is much more challenging, the main cause is that recent approaches learn human interactive relations via shallow graphical representations, which is inadequate to model complicated human interactions. In this paper, we propose a deep logic-aware graph network, which combines the representative ability of graph attention and the rigorousness of logical reasoning to facilitate human interaction understanding. Our network consists of three components, a backbone CNN to extract image features, a graph network to learn interactive relations among participants, and a logic-aware reasoning module. Our key observation is that the first-order logic for HIU can be embedded into higher-order energy functions, minimizing which delivers logic-aware predictions. An efficient mean-field inference algorithm is proposed, such that all modules of our network could be trained jointly in an end-to-end way. Experimental results show that our approach achieves leading performance on three existing benchmarks and a new challenging dataset crafted by ourselves. Code will be publicly available.
翻译:与人类活动分类方面取得的进展相比,在人类互动理解(HIU)方面所取得的成功要少得多。除了后一项任务更具挑战性之外,主要的原因是,最近的方法通过浅色图形显示来学习人类互动关系,这不足以模拟复杂的人类互动。在本文件中,我们提议了一个深层次的逻辑认知图网络,将图示关注的代表性和逻辑推理的严格性结合起来,以促进人类互动理解。我们的网络由三个部分组成,一个主干CNN来提取图像特征,一个用于学习参与者之间互动关系的图形网络和一个逻辑认知推理模块。我们的主要观察是,HIU的第一阶逻辑可以嵌入更高层次的能源功能中,最大限度地减少提供逻辑认知预测。建议一种高效的中位推算法,这样,我们网络的所有模块都可以以端对端方式联合培训。实验结果表明,我们的方法在三个现有基准上取得了领先的业绩,我们自己设计的新的具有挑战性的数据集将公开提供。