We introduce ART, Articulated Reconstruction Transformer -- a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as part-based prediction. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable for simulation. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.
翻译:我们提出ART(Articulated Reconstruction Transformer)——一种类别无关的前馈模型,能够仅从稀疏的多状态RGB图像中重建完整的3D关节化物体。现有的关节物体重建方法要么依赖于脆弱的跨状态对应关系的缓慢优化,要么使用仅限于特定物体类别的前馈模型。相比之下,ART将关节化物体视为刚性部件的组合,将重建问题形式化为基于部件的预测。我们新设计的Transformer架构将稀疏图像输入映射到一组可学习的部件槽,ART从中联合解码出各部件(包括其3D几何、纹理和显式关节参数)的统一表示。所得重建结果具有物理可解释性,并可轻松导出用于仿真。通过在具有逐部件监督的大规模多样化数据集上进行训练,并在多个基准测试中评估,ART相比现有基线取得了显著提升,为基于图像输入的关节物体重建确立了新的技术标杆。