We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem. Different from existing paradigms that directly synthesize a full image from a single input (e.g., a latent code), the new formulation enables a flexible local manipulation for different image regions, which makes it possible to learn content-aware and fine-grained style control for image synthesis. Specifically, it takes as input a sequence of latent tokens to predict the visual tokens for synthesizing an image. Under this perspective, we propose a token-based generator (i.e.,TokenGAN). Particularly, the TokenGAN inputs two semantically different visual tokens, i.e., the learned constant content tokens and the style tokens from the latent space. Given a sequence of style tokens, the TokenGAN is able to control the image synthesis by assigning the styles to the content tokens by attention mechanism with a Transformer. We conduct extensive experiments and show that the proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks, including FFHQ and LSUN CHURCH with different resolutions. In particular, the generator is able to synthesize high-fidelity images with 1024x1024 size, dispensing with convolutions entirely.
翻译:我们展示了通过将图像合成作为视觉象征性生成问题实现图像合成的新视角。 与直接合成从单个输入( 如潜值代码)中生成完整图像的现有模式不同, 新配方使得对不同图像区域进行灵活的本地操作, 从而可以学习内容觉悟和精细雕刻风格控制图像合成。 具体地说, 它将一系列潜在符号作为输入, 以预测图像合成的视觉标识。 在这个角度下, 我们提议了一个基于象征性的生成器( 即 TokenGAN ) 。 特别是, TokenGAN 输入了两种不同的图像符号, 即从潜在空间学习到的常态内容符号和风格符号。 从一系列样式符号中, TokenGAN 能够控制图像合成, 将样式指派给一个关注机制用于图像合成的图像符号。 我们进行了广泛的实验, 并展示了拟议的 TokenGAN 在几个广泛使用的图像合成基准上取得了状态结果, 包括学习的常数不变内容符号和风格式图像 10 能够与高级合成 FFHQ 和 CS 10 完全合成高版本图像。