无需搜索的规划：通过离线目标条件强化学习优化前沿大语言模型 (Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL)

Large language models (LLMs) excel in tasks like question answering and dialogue, but complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning. Reinforcement learning (RL) fine-tuning can enable such planning in principle, but suffers from drawbacks that hinder scalability. In particular, multi-turn RL training incurs high memory and computational costs, which are exacerbated when training LLMs as policies. Furthermore, the largest LLMs do not expose the APIs necessary to be trained in such manner. As a result, modern methods to improve the reasoning of LLMs rely on sophisticated prompting mechanisms rather than RL fine-tuning. To remedy this, we propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents, that scales even to large API-based models. These value functions predict how a task will unfold given an action, allowing the LLM agent to evaluate multiple possible outcomes, both positive and negative, to plan effectively. In addition, these value functions are trained over reasoning steps rather than full actions, to be a concise and light-weight module that facilitates decision-making in multi-turn interactions. We validate our method on tasks requiring interaction, including tool use, social deduction, and dialogue, demonstrating superior performance over both RL fine-tuning and prompting methods while maintaining efficiency and scalability.

翻译：大语言模型（LLMs）在问答和对话等任务中表现出色，但需要交互的复杂任务（如谈判和说服）则需要额外的长程推理与规划能力。强化学习（RL）微调原则上能够实现此类规划，但其存在阻碍可扩展性的缺陷。具体而言，多轮RL训练会带来高昂的内存与计算成本，在将LLMs作为策略进行训练时这一问题尤为突出。此外，最大的LLMs通常不提供支持此类训练所需的API接口。因此，当前改进LLM推理能力的方法主要依赖复杂的提示机制而非RL微调。为应对这一挑战，我们提出一种创新方法：利用目标条件价值函数来引导LLM智能体的推理过程，该方法甚至可扩展至大型基于API的模型。这些价值函数能够预测给定行动后任务的发展轨迹，使LLM智能体能够评估多种可能结果（包括积极与消极结果）以实现有效规划。同时，这些价值函数基于推理步骤而非完整行动进行训练，形成简洁轻量的模块，以促进多轮交互中的决策制定。我们在需要交互的任务（包括工具使用、社交演绎和对话）上验证了本方法的有效性，结果表明其在保持效率与可扩展性的同时，性能优于RL微调和提示方法。