Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers and obtain the state of the art among autoregressive models on class-conditional ImageNet. Code and pretrained models can be found at https://github.com/CompVis/taming-transformers .
翻译:变压器旨在学习关于相继数据的长距离互动。变压器在设计上继续显示关于一系列广泛任务的最新艺术结果。与有线电视新闻网相比,它们并不包含任何优先进行本地互动的感化偏差。这使得它们具有表达性,但也在计算上对长序列不可行,例如高分辨率图像。我们展示了CNN的感应偏差的有效性如何与变压器的直观性结合起来,使他们能够模拟并从而合成高分辨率图像。我们展示了如何(一)使用CNN来学习内容丰富的图像成份词汇,并反过来(二)利用变压器在高分辨率图像中有效地模拟其构成。我们的方法很容易适用于有条件的合成任务,在这种情况下,非空间信息(如对象类别)和空间信息(如分解)能够控制生成的图像。我们特别介绍了关于变压器巨型像图像的由静态制导导合成的第一个结果,并获得了在等级偏移图像网络上自动反向模型中的艺术状态。代码和预设的模型可以在 http://comgistransrefressreportstraction/reportstraction。