We consider the problem of learning a function that can estimate the 3D shape, articulation, viewpoint, texture, and lighting of an articulated animal like a horse, given a single test image. We present a new method, dubbed MagicPony, that learns this function purely from in-the-wild single-view images of the object category, with minimal assumptions about the topology of deformation. At its core is an implicit-explicit representation of articulated shape and appearance, combining the strengths of neural fields and meshes. In order to help the model understand an object's shape and pose, we distil the knowledge captured by an off-the-shelf self-supervised vision transformer and fuse it into the 3D model. To overcome common local optima in viewpoint estimation, we further introduce a new viewpoint sampling scheme that comes at no added training cost. Compared to prior works, we show significant quantitative and qualitative improvements on this challenging task. The model also demonstrates excellent generalisation in reconstructing abstract drawings and artefacts, despite the fact that it is only trained on real images.
翻译:我们考虑了学习一个函数的问题,该函数可以估计像马一样的3D形状、表达、观点、纹理和亮度,给出了一个单一的测试图像。我们展示了一种新的方法,称为魔术小马,它纯粹从物体类别的在野单视图像中学习这一功能,对变形的形态学假设极少。其核心是清晰的形状和外观的隐含和清晰的表示,结合了神经场和模具的长处。为了帮助模型理解物体的形状和外形,我们把由现成的自我监督的视觉变形器所获取的知识分解出来,并将其结合到3D模型中。为了在视觉估计中克服常见的本地选择,我们进一步引入一个新的观点抽样计划,这个计划没有增加培训成本。与以前的工作相比,我们展示了这项具有挑战性的任务在数量和质量上的重大改进。这个模型还表明,在重建抽象的图画和工艺品方面,尽管它只是经过实际图像的培训。