Recent work has shown the possibility of training generative models of 3D content from 2D image collections on small datasets corresponding to a single object class, such as human faces, animal faces, or cars. However, these models struggle on larger, more complex datasets. To model diverse and unconstrained image collections such as ImageNet, we present VQ3D, which introduces a NeRF-based decoder into a two-stage vector-quantized autoencoder. Our Stage 1 allows for the reconstruction of an input image and the ability to change the camera position around the image, and our Stage 2 allows for the generation of new 3D scenes. VQ3D is capable of generating and reconstructing 3D-aware images from the 1000-class ImageNet dataset of 1.2 million training images. We achieve an ImageNet generation FID score of 16.8, compared to 69.8 for the next best baseline method.
翻译:最近的工作表明,有可能对2D图像集的3D内容的3D内容基因化模型进行培训,这种模型来自与单个物体类别相对应的小型数据集,如人类脸部、动物脸部或汽车。然而,这些模型在更大、更复杂的数据集上挣扎。为模拟图像网等多样化和不受限制的图像集,我们介绍了VQ3D,该模型将基于NeRF的解码器引入一个两级矢量量化的自动编码器。我们的第一阶段允许重建一个输入图像,并能够改变图像周围的摄像头位置。我们的第二阶段允许生成新的3D场景。VQ3D能够从120万个培训图像的1000个级图像网数据集中生成和重建3D-awre图像。我们实现了图像网生成FID分数为16.8,而下一个最佳基线方法则为69.8。