We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io
翻译:我们展示了Muse, 一种文本到图像的变换模型, 实现最先进的图像生成性能, 而比扩散或自动递增模型效率要高得多, 与自动递增模型相比, 如Parti, Muse在离散的象征性空间中接受遮蔽型模型培训: 鉴于从预先训练的大型语言模型(LLM) 中提取的文本嵌入, Muse 受过随机地预测掩码图像符号的培训。 与图像和DALL- E 2 等像素- 空间扩散模型相比, 实现最先进的图像生成效果, 并且需要使用离散的符号和较少的采样; 与自动递增模型相比, 如Parti, Muse 在使用平行解码的情况下, 更高效的模型。 使用经过预先训练的LMMMM 能够精确的语言理解, 转换成高触动的图像生成, 并理解诸如物体、 空间关系、 模型、 基本等等视觉概念概念。 我们的900Muse 参数模型在 CC3MM 上实现了一个新的SATA,, 和 FID 6 006 分分。 Muse 的MIS 的MIS 分级, 和 CFIS 分级的C- 的分级的C- degraduc- sal- sal- sal- sal- dal- dal- ex- sal- dal- sal- sal- sal- sal- sessionalvialvial- sal- sal- salviewdaldaldaldalviewdaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldal