We present Farm3D, a method to learn category-specific 3D reconstructors for articulated objects entirely from "free" virtual supervision from a pre-trained 2D diffusion-based image generator. Recent approaches can learn, given a collection of single-view images of an object category, a monocular network to predict the 3D shape, albedo, illumination and viewpoint of any object occurrence. We propose a framework using an image generator like Stable Diffusion to generate virtual training data for learning such a reconstruction network from scratch. Furthermore, we include the diffusion model as a score to further improve learning. The idea is to randomise some aspects of the reconstruction, such as viewpoint and illumination, generating synthetic views of the reconstructed 3D object, and have the 2D network assess the quality of the resulting image, providing feedback to the reconstructor. Different from work based on distillation which produces a single 3D asset for each textual prompt in hours, our approach produces a monocular reconstruction network that can output a controllable 3D asset from a given image, real or generated, in only seconds. Our network can be used for analysis, including monocular reconstruction, or for synthesis, generating articulated assets for real-time applications such as video games.
翻译:-
我们提出了一种称为农场3D的方法,可以完全从预训练的2D扩散图像生成器的“自由”虚拟监督中学习类别特定的关节式对象的3D重建器。最近的方法可以学习给定物体类别的单视图图像集合,预测任何物体出现的3D形状,反照率,照明和视点。我们提出了一种使用像稳定扩散这样的图像生成器的框架,从头开始学习这样的重建网络,生成虚拟训练数据。此外,我们将扩散模型作为得分加入到网络中,以进一步改善学习效果。其主要思想是随机化某些重建的方面,例如视点和照明,生成重建3D对象的合成视图,并让2D网络评估生成的图像的质量,向重建器提供反馈。与基于蒸馏的工作不同,其针对每个文本提示仅生成单个3D资产,并花费数小时,我们的方法在几秒钟内生成一个可以从给定的图像(真实或合成)输出可控3D资产的单眼重建网络。我们的网络可用于分析,包括单目重建,也可以用于综合,为实时应用程序(如视频游戏)生成关节式资产。