Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.
翻译:统一生成模型旨在通过单一架构和解码范式处理跨模态的多样化任务——例如文本生成、图像生成以及视觉语言推理。自回归统一模型因顺序解码导致推理速度缓慢,而非自回归统一模型则因预训练主干网络能力有限而泛化性能较弱。我们提出Muddit,一个统一的离散扩散Transformer模型,能够在文本与图像模态上实现快速并行生成。与以往从头训练的统⼀扩散模型不同,Muddit将预训练文本到图像主干网络中的强视觉先验与轻量级文本解码器相结合,在统一架构下实现了灵活且高质量的多模态生成。实验结果表明,在生成质量与效率方面,Muddit相较于规模显著更大的自回归模型均展现出竞争力或更优性能。本工作揭示了当配备强视觉先验时,纯离散扩散模型作为统一生成可扩展高效主干网络的潜力。