Muddit：基于统一离散扩散模型实现超越文本到图像生成的解放 (Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model)

Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

翻译：统一生成模型旨在通过单一架构和解码范式处理跨模态的多样化任务——例如文本生成、图像生成以及视觉语言推理。自回归统一模型因顺序解码导致推理速度缓慢，而非自回归统一模型则因预训练主干网络能力有限而泛化性能较弱。我们提出Muddit，一个统一的离散扩散Transformer模型，能够在文本与图像模态上实现快速并行生成。与以往从头训练的统⼀扩散模型不同，Muddit将预训练文本到图像主干网络中的强视觉先验与轻量级文本解码器相结合，在统一架构下实现了灵活且高质量的多模态生成。实验结果表明，在生成质量与效率方面，Muddit相较于规模显著更大的自回归模型均展现出竞争力或更优性能。本工作揭示了当配备强视觉先验时，纯离散扩散模型作为统一生成可扩展高效主干网络的潜力。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日