We propose an unsupervised, mid-level representation for a generative model of scenes. The representation is mid-level in that it is neither per-pixel nor per-image; rather, scenes are modeled as a collection of spatial, depth-ordered "blobs" of features. Blobs are differentiably placed onto a feature grid that is decoded into an image by a generative adversarial network. Due to the spatial uniformity of blobs and the locality inherent to convolution, our network learns to associate different blobs with different entities in a scene and to arrange these blobs to capture scene layout. We demonstrate this emergent behavior by showing that, despite training without any supervision, our method enables applications such as easy manipulation of objects within a scene (e.g., moving, removing, and restyling furniture), creation of feasible scenes given constraints (e.g., plausible rooms with drawers at a particular location), and parsing of real-world images into constituent parts. On a challenging multi-category dataset of indoor scenes, BlobGAN outperforms StyleGAN2 in image quality as measured by FID. See our project page for video results and interactive demo: http://www.dave.ml/blobgan
翻译:我们建议为一种基因化的场景模型提供一个不受监督的、中级的场景。 场景是中级的, 因为它既不是每个像素, 也不是每个象素; 相反, 场景是作为空间、 深度排序的“ 蓝球” 特征集的模型来建模的。 阵列被不同地格置于一个功能网格上, 通过一个基因化的对立网络将它解码成图像。 由于布料的空间统一性以及演动所固有的地点, 我们的网络学会将不同的小块与一个场景的不同实体联系起来, 并安排这些小块来捕捉场景的布局。 我们展示了这种突发的行为, 我们展示了这一点,尽管没有经过任何监督培训, 我们的方法仍然能够使各种物体在场景( 例如移动、 搬移、 和 重新整理家具) 上容易操作, 创造出可行的场景( 例如, 在某个特定地点有抽屉看似的房间), 并且将真实世界的图像分解成构成部分。 在一个具有挑战性的多类的室内图像集中, BlobGAN 超越了图像/ 样式GAN 的图像质量项目, 测量了我们的图像/ slegrefrigmlalGAN 。 在我们的图像中测量为我们的图像/ slegleglegmalgmalglegmalGAN 。