We present CARTO, a novel approach for reconstructing multiple articulated objects from a single stereo RGB observation. We use implicit object-centric representations and learn a single geometry and articulation decoder for multiple object categories. Despite training on multiple categories, our decoder achieves a comparable reconstruction accuracy to methods that train bespoke decoders separately for each category. Combined with our stereo image encoder we infer the 3D shape, 6D pose, size, joint type, and the joint state of multiple unknown objects in a single forward pass. Our method achieves a 20.4% absolute improvement in mAP 3D IOU50 for novel instances when compared to a two-stage pipeline. Inference time is fast and can run on a NVIDIA TITAN XP GPU at 1 HZ for eight or less objects present. While only trained on simulated data, CARTO transfers to real-world object instances. Code and evaluation data is available at: http://carto.cs.uni-freiburg.de
翻译:我们提出了一种新颖的方法CARTO来从单个立体RGB观察中重建多个关节式物体。我们使用隐式物体中心表示,并针对多个物体类别学习一个几何和关节式译码器。尽管训练多个类别,我们的译码器实现了与为每个类别单独训练定制的译码器相当的重建准确性。结合我们的立体图像编码器,我们可以在单个前向传递中推断多个未知物体的三维形状、六维姿态、大小、关节类型和关节状态。与两阶段处理流程相比,我们的方法在对于新实例的mAP 3DIOU50绝对改进了20.4%。推理时间快,可以在NVIDIA TITAN XP GPU上以1HZ的速度运行,处理八个或更少的物体。虽然只在模拟数据上训练,但CARTO可以转移到真实世界的物体实例。代码和评估数据可在以下网址找到:http://carto.cs.uni-freiburg.de