Humans constantly contact objects to move and perform tasks. Thus, detecting human-object contact is important for building human-centered artificial intelligence. However, there exists no robust method to detect contact between the body and the scene from an image, and there exists no dataset to learn such a detector. We fill this gap with HOT ("Human-Object conTact"), a new dataset of human-object contacts for images. To build HOT, we use two data sources: (1) We use the PROX dataset of 3D human meshes moving in 3D scenes, and automatically annotate 2D image areas for contact via 3D mesh proximity and projection. (2) We use the V-COCO, HAKE and Watch-n-Patch datasets, and ask trained annotators to draw polygons for the 2D image areas where contact takes place. We also annotate the involved body part of the human body. We use our HOT dataset to train a new contact detector, which takes a single color image as input, and outputs 2D contact heatmaps as well as the body-part labels that are in contact. This is a new and challenging task that extends current foot-ground or hand-object contact detectors to the full generality of the whole body. The detector uses a part-attention branch to guide contact estimation through the context of the surrounding body parts and scene. We evaluate our detector extensively, and quantitative results show that our model outperforms baselines, and that all components contribute to better performance. Results on images from an online repository show reasonable detections and generalizability.
翻译:人类不断地接触物体以移动和执行任务。因此,检测人体-物体接触对于构建以人为中心的人工智能非常重要。然而,目前不存在一种强大的方法来检测从图像中的人体和场景之间的接触,并且也没有可以用来学习这种探测器的数据集。我们通过 HOT (人-物体接触) 数据集填补了这一空白,这是一组新的人-物体联系人图像数据集。为了构建 HOT,我们使用了两种数据来源:(1)我们使用了3D人体网格在3D场景中运动的 PROX 数据集,并通过3D网格接近度和投影自动注释2D图像区域进行接触。 (2) 我们使用 V-COCO、HAKE 和 Watch-n-Patch 数据集,并要求训练有素的注释员为发生接触的2D图像区域绘制多边形。我们还注释了人体的相关部位。我们使用我们的 HOT 数据集来训练一个新的接触探测器,该探测器将单个彩色图像作为输入,并输出2D接触热图以及接触的身体部位标签。这是一个通过周围身体部位和场景的上下文指导接触估计的新型且具有挑战性的任务,扩展了当前的足-地面或手-物体接触探测器到整个身体的广泛性。探测器使用一个部分注意分支,通过围绕身体部位和场景的上下文来指导接触估计。我们对探测器进行了广泛的评估,定量结果显示我们的模型优于基线,并且所有组件都有助于更好的性能。来自在线存储库的图像结果显示了合理的检测和概括能力。