扩展大型语言模型的动作空间以实现超越语言的推理 (Expanding the Action Space of LLMs to Reason Beyond Language)

Large Language Models (LLMs) are powerful reasoners in natural language, but their actions are typically confined to outputting vocabulary tokens. As a result, interactions with external environments -- such as symbolic operators or simulators -- must be expressed through text in predefined formats, parsed, and routed to external interfaces. This overloads the model's language with both reasoning and control duties, and requires a hand-crafted parser, external to the LLM. To address this, we decouple environment interactions from language by internalizing them in an Expanded Action space (ExpA), beyond the vocabulary. The model starts reasoning in the default language environment, but may trigger routing actions and switch to an external environment at any time. From there, the model can only invoke environment-specific actions, receive feedback from the environment, and potentially route back to language as a result. To promote effective exploration of the expanded action space and new environments, we introduce ExpA Reinforcement Learning (EARL) with counterfactual policy optimization. On tasks requiring multi-turn interactions and contingent planning, EARL outperforms strong baselines with vocabulary-constrained actions. It performs robustly across calculator-based multi-task learning and, in the partially observed sorting problem, achieves perfect Sort-4 accuracy while self-discovering an efficient algorithm competitive with classical designs.

翻译：大型语言模型（LLMs）在自然语言推理方面表现出强大能力，但其动作通常仅限于输出词汇标记。因此，与外部环境（如符号运算符或模拟器）的交互必须通过预定义格式的文本进行表达、解析并路由至外部接口。这使模型的语言负担过重，同时承担推理与控制职责，且需要依赖独立于LLM的手工解析器。为解决此问题，我们通过将环境交互内化至超越词汇的扩展动作空间（ExpA），实现其与语言的解耦。模型在默认语言环境中启动推理，但可随时触发路由动作并切换至外部环境。在此状态下，模型仅能调用环境特定动作、接收环境反馈，并可能基于结果路由回语言环境。为促进对扩展动作空间及新环境的有效探索，我们提出具有反事实策略优化的扩展动作强化学习（EARL）。在需要多轮交互与条件规划的任务中，EARL的表现优于采用词汇约束动作的强基线方法。该方法在基于计算器的多任务学习中表现稳健，并在部分可观测排序问题中实现100%的Sort-4准确率，同时自主发现可与经典设计相媲美的高效算法。