Actions are about how we interact with the environment, including other people, objects, and ourselves. In this paper, we propose a novel multi-modal Holistic Interaction Transformer Network (HIT) that leverages the largely ignored, but critical hand and pose information essential to most human actions. The proposed "HIT" network is a comprehensive bi-modal framework that comprises an RGB stream and a pose stream. Each of them separately models person, object, and hand interactions. Within each sub-network, an Intra-Modality Aggregation module (IMA) is introduced that selectively merges individual interaction units. The resulting features from each modality are then glued using an Attentive Fusion Mechanism (AFM). Finally, we extract cues from the temporal context to better classify the occurring actions using cached memory. Our method significantly outperforms previous approaches on the J-HMDB, UCF101-24, and MultiSports datasets. We also achieve competitive results on AVA. The code will be available at https://github.com/joslefaure/HIT.
翻译:有关我们如何与环境互动的行动, 包括其他人、 物体和我们自己。 在本文中, 我们提出一个新的多模式整体互动变换网络( HIT), 利用大部分被忽视但至关重要的手, 并为大多数人类行动提供必不可少的信息。 提议的“ HIT” 网络是一个全面的双模式框架, 包括 RGB 流和一个布局流。 每个网络中都有不同的模型人、 对象和手动互动。 在每一个子网络中, 引入了一个内式聚合模块( IMA ), 有选择地合并个体互动单位。 每种模式产生的功能随后使用加速融合机制( AFM ) 粘合起来。 最后, 我们从时间角度提取线索, 用缓存对发生的行为进行更好的分类。 我们的方法大大优于先前在 J- HMDB 、 UCFC101-24 和 MultiSports 数据集上的做法。 我们还在 AVA上取得了竞争性的结果。 该代码将在 https://github.com/joslefure/ HIT 上查阅 。