Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary rather than isomorphic modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on approximately 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7 percent over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts. These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.
翻译:多模态推理需要语言与视觉之间的迭代协调,但何种交错思维链具有实质意义仍不明确。我们提出,文本与图像思维应作为互补而非同构的模态,共同推进推理进程。基于此原则,我们构建了ThinkMorph——一个在约24K条高质量交错推理轨迹上微调的统一模型,这些轨迹涵盖不同视觉参与度的任务。ThinkMorph能够生成渐进式的文本-图像推理步骤,在保持连贯语言逻辑的同时具体操控视觉内容。该模型在视觉中心基准测试中实现显著提升(平均超越基础模型34.7%),并能泛化至领域外任务,其表现匹配或超越规模更大、参数私有的视觉语言模型。除性能优势外,ThinkMorph展现出涌现的多模态智能特性,包括未预见的视觉操控能力、推理模式的自适应切换,以及通过多样化多模态思维实现更优的测试时扩展性。这些发现为探索统一多模态推理模型的涌现能力表征提供了新方向。