Scaling up multimodal models has enabled remarkable advances in visual understanding and reasoning, but practical demands call for smaller, efficient systems. In this work, we conduct a principled analysis of downscaling intelligence in multimodal models, examining how reduced large language model (LLM) capacity affects multimodal capabilities. Our initial findings reveal an interesting trend: LLM downscaling disproportionately affects visual capabilities, rather than abilities inherited from the LLM. We then examine whether this drop mainly reflects the expected decline in visual reasoning or a more fundamental loss of perceptual abilities. Isolating the effect of LLM downscaling on perception, we find performance still drops sharply, often matching or exceeding the impact on reasoning. To address this bottleneck, we introduce visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks. With these extracted visual details, we then apply step-by-step reasoning to generate answers. Together, these components form our Extract+Think approach, setting a new standard for efficiency and performance in this space.
翻译:多模态模型的规模化扩展在视觉理解与推理方面取得了显著进展,但实际应用需求呼唤更小、更高效的系统。本研究对多模态模型中的智能缩小进行了系统性分析,探讨了降低大语言模型(LLM)容量如何影响多模态能力。初步发现揭示了一个有趣趋势:LLM的缩小对视觉能力的影响不成比例地大于对LLM继承能力的影响。我们进一步检验了这种性能下降主要反映了视觉推理能力的预期衰退,还是更根本的感知能力丧失。通过隔离LLM缩小对感知的影响,我们发现性能仍急剧下降,其影响程度常与推理能力下降相当甚至更甚。为应对这一瓶颈,我们引入了视觉提取调优方法,该方法通过显式训练模型在不同任务中持续提取与指令相关的视觉细节。利用这些提取的视觉细节,我们随后应用逐步推理生成答案。这些组件共同构成了我们的“提取+思考”方法,为该领域的效率与性能设立了新标准。