We present BlockGAN, an image generative model that learns object-aware 3D scene representations directly from unlabelled 2D images. Current work on scene representation learning either ignores scene background or treats the whole scene as one object. Meanwhile, work that considers scene compositionality treats scene objects only as image patches or 2D layers with alpha maps. Inspired by the computer graphics pipeline, we design BlockGAN to learn to first generate 3D features of background and foreground objects, then combine them into 3D features for the wholes cene, and finally render them into realistic images. This allows BlockGAN to reason over occlusion and interaction between objects' appearance, such as shadow and lighting, and provides control over each object's 3D pose and identity, while maintaining image realism. BlockGAN is trained end-to-end, using only unlabelled single images, without the need for 3D geometry, pose labels, object masks, or multiple views of the same scene. Our experiments show that using explicit 3D features to represent objects allows BlockGAN to learn disentangled representations both in terms of objects (foreground and background) and their properties (pose and identity).
翻译:我们展示了布洛克GAN(BlockGAN) 的图像基因模型, 该模型直接从未贴标签的 2D 图像中学习对象认知 3D 的场景演示。 目前关于现场演示的工作 学习要么忽略了现场背景, 要么将整个场景当作一个对象对待。 同时, 将场景构成性的工作将场景物体作为图像补丁或2D 层用阿尔法地图处理。 在计算机图形管道的启发下, 我们设计布洛克GAN( BlockGAN) 来学习首先生成背景和前景对象的 3D 特征, 然后将其结合到 3D 特征中, 用于整片的cene, 并最终将其转化为现实图像。 这让布洛克GAN 能够对物体外观之间的隔离和互动进行解释, 例如阴影和照明, 并且提供对每个对象的 3D 的外观和背景进行控制, 同时保持图像真实性。 BlockGAN 受过最终培训, 仅使用未贴标签的单个图像, 而不需要 3D 地貌、 标签、 或同一场景的多重视图 。 我们的实验显示, 3D 3D 3D 能够让 BlockGAN ( 和背景 ) 学习对象 和背景 。