从构建模块到规划：基于强化学习的大语言模型多步空间推理 (From Building Blocks to Planning: Multi-Step Spatial Reasoning in LLMs with Reinforcement Learning)

Spatial reasoning in large language models (LLMs) has gained increasing attention due to applications in navigation and planning. Despite strong general language capabilities, LLMs still struggle with spatial transformations and multi-step planning in structured environments. We propose a two-stage approach that decomposes spatial reasoning into atomic building blocks and their composition. First, we apply supervised fine-tuning on elementary spatial transformations, such as rotation, translation, and scaling, to equip the model with basic spatial physics. We then freeze this physics-aware model and train lightweight LoRA adapters within the GRPO framework to learn policies that compose these building blocks for multi-step planning in puzzle-based environments, in a closed-loop manner. To support this pipeline, we synthesize an ASCII-art dataset and construct a corresponding ASCII-based reinforcement learning environment. Our method consistently outperforms baselines, including the generic backbone, physics-aware model, and end-to-end RL models, under both Dynamic environments with explicit state updates and Static environments where the model must rely on its internal state across steps. In addition, the proposed approach converges faster and exhibits more stable training compared to end-to-end reinforcement learning from scratch. Finally, we analyze attention patterns to assess whether fine-tuning induces meaningful improvements in spatial understanding.

翻译：大语言模型（LLMs）的空间推理能力因其在导航与规划中的应用而日益受到关注。尽管具备强大的通用语言能力，LLMs在结构化环境中的空间变换与多步规划任务上仍面临困难。本文提出一种两阶段方法，将空间推理分解为原子构建模块及其组合过程。首先，通过对旋转、平移、缩放等基础空间变换进行监督微调，使模型掌握基本空间物理规律。随后冻结该物理感知模型，在GRPO框架内训练轻量级LoRA适配器，以学习在基于谜题的环境中通过闭环方式组合这些构建模块实现多步规划的决策策略。为支持该流程，我们合成了ASCII艺术数据集并构建了对应的ASCII强化学习环境。在具有显式状态更新的动态环境与需要模型依赖内部状态进行多步推理的静态环境中，本方法均持续优于基线模型（包括通用骨干模型、物理感知模型及端到端强化学习模型）。此外，相比从零开始的端到端强化学习，所提方法收敛速度更快且训练过程更稳定。最后，我们通过分析注意力模式来评估微调是否真正提升了模型的空间理解能力。