Human-object interactions with articulated objects are common in everyday life. Despite much progress in single-view 3D reconstruction, it is still challenging to infer an articulated 3D object model from an RGB video showing a person manipulating the object. We canonicalize the task of articulated 3D human-object interaction reconstruction from RGB video, and carry out a systematic benchmark of five families of methods for this task: 3D plane estimation, 3D cuboid estimation, CAD model fitting, implicit field fitting, and free-form mesh fitting. Our experiments show that all methods struggle to obtain high accuracy results even when provided ground truth information about the observed objects. We identify key factors which make the task challenging and suggest directions for future work on this challenging 3D computer vision task. Short video summary at https://www.youtube.com/watch?v=5tAlKBojZwc
翻译:尽管在单视图 3D 重建中取得了很大进展,但从显示操纵该天体的人的 RGB 视频中推断出一个清晰的 3D 对象模型仍然具有挑战性。我们可以从 RGB 视频中将3D 人与目标的交互重建明确化任务从 RGB 视频中进行,并针对这项任务的五组方法进行系统的基准:3D 平面估计、 3D 幼摸估计、 CAD 模型安装、 隐含字段安装和自由成型网格。我们的实验表明,所有方法都难以获得高准确性结果,即使提供了所观测天体的地面真象信息。我们确定了使任务具有挑战性的关键因素,并为今后开展这项具有挑战性3D 计算机愿景的任务提出了方向。在 https://www.youtube.com/watch?v=5tAlKBojZwc 上简短的视频摘要。