We introduce SAOR, a novel approach for estimating the 3D shape, texture, and viewpoint of an articulated object from a single image captured in the wild. Unlike prior approaches that rely on pre-defined category-specific 3D templates or tailored 3D skeletons, SAOR learns to articulate shapes from single-view image collections with a skeleton-free part-based model without requiring any 3D object shape priors. To prevent ill-posed solutions, we propose a cross-instance consistency loss that exploits disentangled object shape deformation and articulation. This is helped by a new silhouette-based sampling mechanism to enhance viewpoint diversity during training. Our method only requires estimated object silhouettes and relative depth maps from off-the-shelf pre-trained networks during training. At inference time, given a single-view image, it efficiently outputs an explicit mesh representation. We obtain improved qualitative and quantitative results on challenging quadruped animals compared to relevant existing work.
翻译:我们提出了SAOR,一种新颖的方法,可以从单个野外捕获的图像中估计关节物体的三维形状、纹理和视点。与先前依赖于预定义的类别特定的3D模板或量身定制的3D骨架的方法不同,SAOR学习从单视角图像集中组成的形状,并使用无骨架部件模型进行关节,不需要任何3D物体形状。为了避免病态解,我们提出了跨实例一致性损失,利用分离的对象形状变形和关节。这得益于一种新的基于轮廓的采样机制,在训练过程中增强了视点多样性。我们的方法在训练过程中只需要来自预训练网络的估计物体轮廓和相对深度图像。在推理时,给定单视角图像,它可以高效地输出显式网格表示。与相关现有工作相比,在具有挑战性的四足动物方面取得了改进的定量和定性结果。