Despite recent impressive results on single-object and single-domain image generation, the generation of complex scenes with multiple objects remains challenging. In this paper, we start with the idea that a model must be able to understand individual objects and relationships between objects in order to generate complex scenes well. Our layout-to-image-generation method, which we call Object-Centric Generative Adversarial Network (or OC-GAN), relies on a novel Scene-Graph Similarity Module (SGSM). The SGSM learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity. We also propose changes to the conditioning mechanism of the generator that enhance its object instance-awareness. Apart from improving image quality, our contributions mitigate two failure modes in previous approaches: (1) spurious objects being generated without corresponding bounding boxes in the layout, and (2) overlapping bounding boxes in the layout leading to merged objects in images. Extensive quantitative evaluation and ablation studies demonstrate the impact of our contributions, with our model outperforming previous state-of-the-art approaches on both the COCO-Stuff and Visual Genome datasets. Finally, we address an important limitation of evaluation metrics used in previous works by introducing SceneFID -- an object-centric adaptation of the popular Fr{\'e}chet Inception Distance metric, that is better suited for multi-object images.
翻译:尽管最近在单点和单面图像生成方面取得了令人印象深刻的成果,但生成具有多个天体的复杂场景仍具有挑战性。在本文中,我们首先认为模型必须能够理解各个对象和物体之间的关系,以便产生复杂的场景。我们的布局到图像生成方法,我们称之为“目标-点显性反影网络”(或OC-GAN),它依赖于一个新型的Scene-graphy 相近模块(SGSM)。SGSM学习了现场物体之间的空间关系的表达,这导致模型改进了布局-侧面性。我们还提议改变发电机的调节机制,以加强其目标实例意识。除了提高图像质量外,我们的贡献减轻了先前方法中的两种失败模式:(1) 刺激性物体的生成,没有相应的捆绑框,以及(2) 在布局中出现合并图像对象的布局中重叠的捆绑框。 广泛的定量评价和调整研究表明我们的贡献的影响,我们的模型超越了我们先前的图像状态-目标方法,改进了我们的模型-目标的布局-目标方法,加强了对目标的立面实例的认识。 最后,我们采用的是,我们采用的CFID的模型采用了一个更精确的模型,我们采用的模型,我们采用的模型的模型对前的模型,是更精确的模型。