Generative transformers have shown their superiority in synthesizing high-fidelity and high-resolution images, such as good diversity and training stability. However, they suffer from the problem of slow generation since they need to generate a long token sequence autoregressively. To better accelerate the generative transformers while keeping good generation quality, we propose Lformer, a semi-autoregressive text-to-image generation model. Lformer firstly encodes an image into $h{\times}h$ discrete tokens, then divides these tokens into $h$ mirrored L-shape blocks from the top left to the bottom right and decodes the tokens in a block parallelly in each step. Lformer predicts the area adjacent to the previous context like autoregressive models thus it is more stable while accelerating. By leveraging the 2D structure of image tokens, Lformer achieves faster speed than the existing transformer-based methods while keeping good generation quality. Moreover, the pretrained Lformer can edit images without the requirement for finetuning. We can roll back to the early steps for regeneration or edit the image with a bounding box and a text prompt.
翻译:生成变异器在合成高纤维和高分辨率图像方面表现出其优越性,例如良好的多样性和培训稳定性。然而,它们受到慢生成问题的影响,因为它们需要生成一个长的象征性序列自动递增。为了更好地加速基因变异器,同时保持良好的生成质量,我们提议使用半偏向文本到图像生成模型。先导先导将图像编码为$hxxxyhxhxhxh 离散符号,然后将这些符号从左上至右下方的镜像 Lshape 区块分割成$h$xxl-shape 区块,并在每步中以一个区块的方式解码。前导变异模型预测与上一个环境相邻的区域,如自动递增变异模型,因此在加速的同时会更加稳定。通过利用2D图像代号结构,先导在保持良好的生成质量的同时实现比现有变异器所用方法更快的速度。此外,先导的Lrew可以在不要求细化的情况下编辑图像。我们可以向早期步骤重塑或快速修改图像。</s>