The landscape of image generation has been forever changed by open vocabulary diffusion models. However, at their core these models use transformers, which makes generation slow. Better implementations to increase the throughput of these transformers have emerged, but they still evaluate the entire model. In this paper, we instead speed up diffusion models by exploiting natural redundancy in generated images by merging redundant tokens. After making some diffusion-specific improvements to Token Merging (ToMe), our ToMe for Stable Diffusion can reduce the number of tokens in an existing Stable Diffusion model by up to 60% while still producing high quality images without any extra training. In the process, we speed up image generation by up to 2x and reduce memory consumption by up to 5.6x. Furthermore, this speed-up stacks with efficient implementations such as xFormers, minimally impacting quality while being up to 5.4x faster for large images. Code is available at https://github.com/dbolya/tomesd.
翻译:图像生成的领域被开放词汇扩散模型彻底改变了。然而,在核心部分,这些模型使用变压器,这使得生成变慢。已经出现了更好的实现来增加这些变压器的吞吐量,但它们仍然评估整个模型。在本文中,我们通过利用生成图像中的自然冗余来合并冗余令牌来加速扩散模型。在对ToMe进行了一些扩散特定的改进之后,我们的用于稳定扩散的ToMe可以将现有稳定扩散模型中的令牌数量减少高达60%,同时仍能产生高质量图像而无需任何额外的训练。在这个过程中,我们将图像生成加速了高达2倍,并将内存消耗降低了高达5.6倍。此外,这种加速与诸如xFormers之类的高效实现叠加在一起,对于大型图像,速度提高了5.4倍,而对质量的影响最小。代码可在https://github.com/dbolya/tomesd中获得。