Complex reasoning problems often involve implicit spatial, geometric, and structural relationships that are not explicitly encoded in text. While recent reasoning models have achieved strong performance across many domains, purely text-based reasoning struggles to represent global structural constraints in complex settings. In this paper, we introduce FIGR, which integrates active visual thinking into multi-turn reasoning via end-to-end reinforcement learning. FIGR externalizes intermediate structural hypotheses by constructing visual representations during problem solving. By adaptively regulating when and how visual reasoning should be invoked, FIGR enables more stable and coherent reasoning over global structural properties that are difficult to capture from text alone. Experiments on challenging mathematical reasoning benchmarks demonstrate that FIGR outperforms strong text-only chain-of-thought baselines. In particular, FIGR improves the base model by 13.12% on AIME 2025 and 11.00% on BeyondAIME, highlighting the effectiveness of figure-guided multimodal reasoning in enhancing the stability and reliability of complex reasoning.
翻译:复杂推理问题通常涉及隐含的空间、几何与结构关系,这些关系无法在文本中显式编码。尽管近期推理模型已在多个领域取得优异性能,但纯文本推理在复杂场景中难以表征全局结构约束。本文提出FIGR模型,通过端到端强化学习将主动视觉思维融入多轮推理过程。FIGR通过在问题求解过程中构建视觉表征,将中间结构假设外显化。通过自适应调控视觉推理的触发时机与方式,FIGR能够对难以仅从文本中捕捉的全局结构特性进行更稳定、更连贯的推理。在具有挑战性的数学推理基准测试上的实验表明,FIGR显著优于纯文本思维链基线模型。具体而言,FIGR在AIME 2025上将基础模型性能提升13.12%,在BeyondAIME上提升11.00%,这凸显了图式引导多模态推理在增强复杂推理稳定性与可靠性方面的有效性。