Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image. Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation, respectively. In this work, we show how employing multi-head attention to encode the graph information, as well as using a transformer-based model in the latent space for image generation can improve the quality of the sampled data, without the need to employ adversarial models with the subsequent advantage in terms of training stability. The proposed approach, specifically, is entirely based on transformer architectures both for encoding scene graphs into intermediate object layouts and for decoding these layouts into images, passing through a lower dimensional space learned by a vector-quantized variational autoencoder. Our approach shows an improved image quality with respect to state-of-the-art methods as well as a higher degree of diversity among multiple generations from the same scene graph. We evaluate our approach on three public datasets: Visual Genome, COCO, and CLEVR. We achieve an Inception Score of 13.7 and 12.8, and an FID of 52.3 and 60.3, on COCO and Visual Genome, respectively. We perform ablation studies on our contributions to assess the impact of each component. Code is available at https://github.com/perceivelab/trf-sg2im
翻译:用于控制生成图像构成的基因化模型可以有效地使用图形结构的场景描述来控制生成图像的构成。 先前的方法分别基于图形变异网络和设计预测和图像生成的对抗性方法的组合。 在这项工作中,我们展示了如何在图像生成的潜在空间使用多头注意对图形信息进行编码,以及在图像生成的潜在空间使用基于变压器的模型来提高抽样数据的质量,而不必使用具有随后在培训稳定性方面优势的对立模型。 具体地说,拟议的方法完全基于变压器结构,既用于将场景图形编码成中间对象布局,又用于将这些布局分解成图像,通过由矢量定量变异化自动编码器所学的较低维度空间。 我们的方法表明,在图像质量方面,在图像生成的状态方法方面,以及在同一场景图中多代之间差异程度更高。 我们评估了三种公共数据集:视觉基因组、COCOCOCO和CLEVRt。 我们实现了13.7和12. 和12. 的InGIF 的分数,我们分别对CO. 和OFIS 3 和OFID 的每项对CO. 2和FID 的贡献做了评估。</s>