Real-world scenes often feature multiple humans interacting with multiple objects in ways that are causal, goal-oriented, or cooperative. Yet existing 3D human-object interaction (HOI) benchmarks consider only a fraction of these complex interactions. To close this gap, we present MMHOI -- a large-scale, Multi-human Multi-object Interaction dataset consisting of images from 12 everyday scenarios. MMHOI offers complete 3D shape and pose annotations for every person and object, along with labels for 78 action categories and 14 interaction-specific body parts, providing a comprehensive testbed for next-generation HOI research. Building on MMHOI, we present MMHOI-Net, an end-to-end transformer-based neural network for jointly estimating human-object 3D geometries, their interactions, and associated actions. A key innovation in our framework is a structured dual-patch representation for modeling objects and their interactions, combined with action recognition to enhance the interaction prediction. Experiments on MMHOI and the recently proposed CORE4D datasets demonstrate that our approach achieves state-of-the-art performance in multi-HOI modeling, excelling in both accuracy and reconstruction quality. The MMHOI dataset is publicly available at https://zenodo.org/records/17711786.
翻译:现实场景中常出现多个人类以因果、目标导向或协作方式与多个物体进行交互。然而,现有的三维人-物交互(HOI)基准仅考虑了这些复杂交互的一小部分。为弥补这一空白,我们提出了MMHOI——一个大规模的多人物多物体交互数据集,包含来自12个日常场景的图像。MMHOI为每个人和物体提供完整的三维形状与姿态标注,同时包含78个动作类别和14个交互特定身体部位的标签,为下一代HOI研究提供了全面的测试平台。基于MMHOI,我们提出了MMHOI-Net,这是一个基于Transformer的端到端神经网络,用于联合估计人-物三维几何结构、交互关系及相关动作。我们框架的核心创新在于采用结构化双补丁表示来建模物体及其交互,并结合动作识别以增强交互预测。在MMHOI和近期提出的CORE4D数据集上的实验表明,我们的方法在多HOI建模中实现了最先进的性能,在精度和重建质量方面均表现优异。MMHOI数据集已公开于https://zenodo.org/records/17711786。