We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.
翻译:我们提出了EMMA,一种用于多模态理解、生成与编辑的高效统一架构。具体而言,EMMA主要包括:1)一个具有32倍压缩率的高效自编码器,显著减少了生成所需的令牌数量。通过对图像应用相同的压缩率,该设计也确保了理解与生成任务之间的训练平衡。2)在视觉理解与生成令牌之间采用通道级拼接而非令牌级拼接,进一步减少了统一架构中的视觉令牌数量。3)一个共享与解耦网络,能够在满足任务特定建模需求的同时实现跨任务的相互促进。4)在视觉理解编码器中采用的专家混合机制,以少量参数增加显著提升了感知能力。大量实验表明,EMMA-4B在效率和性能上均显著优于最先进的统一多模态方法(例如BAGEL-7B),同时相较于近期多模态理解与生成专家模型(例如Qwen3-VL和Qwen-Image)也取得了具有竞争力的结果。我们相信EMMA为未来统一多模态架构的发展奠定了坚实基础。