Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified structure with a concise model, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. A common assumption in unified vision-language models is that adding generation will naturally strengthen understanding. However, this is not always true at scale. At 200M+ pretraining samples, generation helps understanding only when it operates at the semantic level, i.e. when the model learns to autoregress high-level visual representations inside the LLM. Once pixel-level objectives (e.g., diffusion losses) directly interfere with the LLM, understanding performance often degrades. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. Unified generation-understanding demonstrates a superior scaling trend compared to understanding alone, revealing a more effective way to learn vision-only knowledge directive from vision modality rather than captioning to text. (3) Autoregression on Input Embedding is effective to capture visual details. Compared to the commonly-used vision encoder, make visual autoregression on input embedding shows less cumulative error and is modality independent, which can be extend to all modalities. The learned semantic representations capture visual information such as objects, locations, shapes, and colors; further enable pixel-level image generation.
翻译:视觉语言大模型正朝着视觉理解与视觉生成任务的统一方向发展。然而,在大数据规模下,生成是否能够增强理解能力仍未得到充分探索。在本工作中,我们通过一个简洁的模型UniHetero,在大规模预训练(>2亿样本)下分析了这种统一架构。我们的关键发现是:(1)生成能够提升理解能力,但前提是生成语义,而非像素。统一视觉语言模型中的一个常见假设是,增加生成任务自然会增强理解能力。然而,在大规模场景下,情况并非总是如此。在超过2亿预训练样本的规模下,只有当生成在语义层面进行时——即模型学习在LLM内部自回归地生成高层视觉表征——生成才对理解有益。一旦像素级目标(例如扩散损失)直接干扰LLM,理解性能往往会下降。(2)生成展现出更优的数据缩放趋势与更高的数据利用率。与仅进行理解相比,统一的生成-理解模型展现出更优的缩放趋势,这揭示了一种更有效的方式:直接从视觉模态而非通过文本描述来学习纯视觉知识指令。(3)在输入嵌入上进行自回归能有效捕捉视觉细节。与常用的视觉编码器相比,在输入嵌入上进行视觉自回归展现出更低的累积误差,且具有模态无关性,可扩展至所有模态。学习到的语义表征能够捕捉物体、位置、形状和颜色等视觉信息,并进一步支持像素级图像生成。