We present a method that achieves state-of-the-art results on challenging (few-shot) layout-to-image generation tasks by accurately modeling textures, structures and relationships contained in a complex scene. After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch. Compared to existing CNN-based and Transformer-based generation models that entangled modeling on pixel-level&patch-level and object-level&patch-level respectively, the proposed focal attention predicts the current patch token by only focusing on its highly-related tokens that specified by the spatial layout, thereby achieving disambiguation during training. Furthermore, the proposed TwFA largely increases the data efficiency during training, therefore we propose the first few-shot complex scene generation strategy based on the well-trained TwFA. Comprehensive experiments show the superiority of our method, which significantly increases both quantitative metrics and qualitative visual realism with respect to state-of-the-art CNN-based and transformer-based methods. Code is available at https://github.com/JohnDreamer/TwFA.
翻译:我们提出一种方法,通过精确模拟复杂场景中所含的纹理、结构和关系,在具有挑战性(光照)的布局到图像生成任务方面实现最先进的结果。在将 RGB 图像压缩成补贴符之后,我们提议以焦点注意的变形器(TwFA) 探索对象对对象、对象对目标、对象对目标对目标和补巴对目标的可靠性。与基于CNN的和基于变形器的生成模型相比,这些模型分别缠绕在像素级和目的一级和对象对等层面,拟议的焦点关注预测目前的补丁符号,仅侧重于空间布局中指定的高度相关的符号,从而在培训期间实现脱色。此外,拟议的TwFA在很大程度上提高了培训期间的数据效率,因此我们提议以训练有素的TFA/MA-DROFA-DROM-DRODRAFAFA-DROFAFA-DAR-DRODRAFAFA-DRAST-DRODRMER 和以州-DROGOB-DRADRADRADRAM-DRAM-DRADRAM-DRAM-DRAM-DRAFAFAR-CMAR-CMAR-C-CMAR 方法为基础,我们建议的第一个几点组合组合组合组合组合组合组合组合生成的模型战略。