Deep generative models allow for photorealistic image synthesis at high resolutions. But for many applications, this is not enough: content creation also needs to be controllable. While several recent works investigate how to disentangle underlying factors of variation in the data, most of them operate in 2D and hence ignore that our world is three-dimensional. Further, only few works consider the compositional nature of scenes. Our key hypothesis is that incorporating a compositional 3D scene representation into the generative model leads to more controllable image synthesis. Representing scenes as compositional generative neural feature fields allows us to disentangle one or multiple objects from the background as well as individual objects' shapes and appearances while learning from unstructured and unposed image collections without any additional supervision. Combining this scene representation with a neural rendering pipeline yields a fast and realistic image synthesis model. As evidenced by our experiments, our model is able to disentangle individual objects and allows for translating and rotating them in the scene as well as changing the camera pose.
翻译:深基因模型允许高分辨率的光现实图像合成。 但对于许多应用程序来说,这还不够:内容创建也需要控制。 虽然最近的一些工作研究如何解开数据差异的基本因素, 但大部分在 2D 中运行, 从而忽略了我们的世界是三维的。 此外, 很少有工作考虑场景的构成性质。 我们的关键假设是将成份的 3D 场景表示纳入基因模型, 导致更可控的图像合成。 作为成份性神经特征字段的场景, 允许我们从背景中解开一个或多个对象, 以及单个对象的形状和外观, 同时在没有任何额外监督的情况下从未结构化和未保存的图像收藏中学习。 将这个场景代表与导线转换成一个快速和现实的图像合成模型。 正如我们的实验所证明的, 我们的模型能够解开单个对象, 并允许在现场翻译和旋转它们, 以及改变摄像头的形状 。