While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains (sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration) demonstrate that DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2\%) and Gemini-3-Flash (+111.6\%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0\%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.
翻译:尽管近期多模态大语言模型(MLLMs)在多模态推理方面取得了显著进展,但其推理过程仍主要围绕文本展开,导致在复杂的长视野、以视觉为中心的任务中表现欠佳。本文提出了一种新颖的生成式多模态推理范式,并引入了DiffThinker——一个基于扩散模型的推理框架。从概念上讲,DiffThinker将多模态推理重新定义为一个原生的生成式图像到图像任务,从而在以视觉为中心的任务中实现了卓越的逻辑一致性和空间精度。我们对DiffThinker与MLLMs进行了系统性比较,首次深入探究了该范式的内在特性,揭示了其四个核心属性:高效性、可控性、原生并行性和协作性。在四个领域(顺序规划、组合优化、约束满足和空间配置)上的大量实验表明,DiffThinker显著超越了包括GPT-5(+314.2%)、Gemini-3-Flash(+111.6%)在内的领先闭源模型,以及经过微调的Qwen3-VL-32B基线模型(+39.0%),这凸显了生成式多模态推理作为以视觉为中心推理的一种极具前景的方法。