Large Language Models (LLMs) have demonstrated remarkable capabilities in knowledge acquisition, reasoning, and tool use, making them promising candidates for autonomous agent applications. However, training LLM agents for complex multi-turn task planning faces significant challenges, including sparse episode-wise rewards, credit assignment across long horizons, and the computational overhead of reinforcement learning in multi-turn interaction settings. To this end, this paper introduces a novel approach that transforms multi-turn task planning into single-turn task reasoning problems, enabling efficient policy optimization through Group Relative Policy Optimization (GRPO) with dense and verifiable reward from expert trajectories. Our theoretical analysis shows that GRPO improvement on single-turn task reasoning results in a lower bound of the multi-turn success probability under the minimal turns, as well as the generalization to subtasks with shorter horizons. Experimental evaluation on the complex task planning benchmark demonstrates that our 1.5B parameter model trained with single-turn GRPO achieves superior performance compared to larger baseline models up to 14B parameters, with success rates of 70% for long-horizon planning tasks.
翻译:大型语言模型(LLMs)在知识获取、推理和工具使用方面展现出卓越能力,使其成为自主智能体应用的有力候选者。然而,为复杂多轮任务规划训练LLM智能体面临显著挑战,包括稀疏的回合式奖励、长时程信用分配以及多轮交互场景中强化学习的计算开销。为此,本文提出一种创新方法,将多轮任务规划转化为单轮任务推理问题,通过基于专家轨迹的密集可验证奖励,利用组相对策略优化(GRPO)实现高效策略优化。理论分析表明,GRPO在单轮任务推理上的改进为最小轮次下的多轮成功概率提供了下界,并可泛化至更短时程的子任务。在复杂任务规划基准上的实验评估显示,采用单轮GRPO训练的1.5B参数模型相比参数规模达14B的基线模型取得了更优性能,在长时程规划任务中达到70%的成功率。