We present a method for inferring diverse 3D models of human-object interactions from images. Reasoning about how humans interact with objects in complex scenes from a single 2D image is a challenging task given ambiguities arising from the loss of information through projection. In addition, modeling 3D interactions requires the generalization ability towards diverse object categories and interaction types. We propose an action-conditioned modeling of interactions that allows us to infer diverse 3D arrangements of humans and objects without supervision on contact regions or 3D scene geometry. Our method extracts high-level commonsense knowledge from large language models (such as GPT-3), and applies them to perform 3D reasoning of human-object interactions. Our key insight is priors extracted from large language models can help in reasoning about human-object contacts from textural prompts only. We quantitatively evaluate the inferred 3D models on a large human-object interaction dataset and show how our method leads to better 3D reconstructions. We further qualitatively evaluate the effectiveness of our method on real images and demonstrate its generalizability towards interaction types and object categories.
翻译:我们提出了一个从图像中推断不同3D型人体物体相互作用模型的方法。根据人类如何从单一的2D图像中与复杂场景物体相互作用的理由,鉴于通过投影丢失信息产生的含混不清之处,3D型模型相互作用是一项具有挑战性的任务。此外,3D型模型相互作用要求具备对不同对象类别和相互作用类型的一般化能力。我们提议了一个具有行动条件的互动模型,使我们能够在不监督接触区域或3D场景几何的情况下推断人类和物体的各种3D型安排。我们的方法从大型语言模型(如GPT-3)中提取高水平的常识知识,并应用这些知识来进行3D型人类物体相互作用的3D推理。我们从大语言模型中提取的关键洞察力有助于从纯文本提示中推理人类物体接触。我们量化地评估大型人类-物体互动数据集的3D型模型,并展示我们的方法如何导致更好的3D重建。我们进一步从质量上评估我们关于真实图像的方法的有效性,并展示其对于互动类型和对象类别的通用性。