Humans constantly interact with daily objects to accomplish tasks. To understand such interactions, computers need to reconstruct these from cameras observing whole-body interaction with scenes. This is challenging due to occlusion between the body and objects, motion blur, depth/scale ambiguities, and the low image resolution of hands and graspable object parts. To make the problem tractable, the community focuses either on interacting hands, ignoring the body, or on interacting bodies, ignoring hands. The GRAB dataset addresses dexterous whole-body interaction but uses marker-based MoCap and lacks images, while BEHAVE captures video of body object interaction but lacks hand detail. We address the limitations of prior work with InterCap, a novel method that reconstructs interacting whole-bodies and objects from multi-view RGB-D data, using the parametric whole-body model SMPL-X and known object meshes. To tackle the above challenges, InterCap uses two key observations: (i) Contact between the hand and object can be used to improve the pose estimation of both. (ii) Azure Kinect sensors allow us to set up a simple multi-view RGB-D capture system that minimizes the effect of occlusion while providing reasonable inter-camera synchronization. With this method we capture the InterCap dataset, which contains 10 subjects (5 males and 5 females) interacting with 10 objects of various sizes and affordances, including contact with the hands or feet. In total, InterCap has 223 RGB-D videos, resulting in 67,357 multi-view frames, each containing 6 RGB-D images. Our method provides pseudo ground-truth body meshes and objects for each video frame. Our InterCap method and dataset fill an important gap in the literature and support many research directions. Our data and code are areavailable for research purposes.
翻译:人类不断与日常对象互动以完成任务。 要理解这种互动, 计算机需要从照相机中重建这些功能, 观察整个身体与场景的互动。 这具有挑战性, 原因是身体和物体之间隔绝、 运动模糊、 深度/ 比例模糊、 手和可抓取对象部件的图像分辨率低。 要解决问题, 社区要么关注交互手, 忽略身体, 要么关注互动体, 忽略手。 巨型BRAB 数据集处理全体的动态, 但使用基于标记的 MoCap, 缺乏图像, 而 BEHAVE 则捕捉到身体物体互动的视频, 但缺乏手动细节。 我们解决了InterCAP(InterCAP)先前工作的局限性, 这是一种新颖的方法, 重建从多视图 RGB 3 D 数据上的互动整体和可抓取对象。 为了应对上述挑战, InterCAP 使用两种关键的意见:(i) 手和天体之间的联系可以用来改进对两者的图像的估算。 (ii) Azure Kinect 传感器允许我们设置一个简单的多角度的图像框架, 。