Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.
翻译:强化学习(RL)已能够训练大规模语言模型(LLM)智能体与环境交互,以解决多轮长程任务。然而,经RL训练的智能体在需要主动探索的任务中往往表现不佳,且难以从试错经验中高效适应。本文提出LaMer,一种通用的元强化学习框架,使LLM智能体能够在测试时主动探索并从环境反馈中学习。LaMer包含两个关键组件:(i)跨回合训练框架,以鼓励探索和长期奖励优化;(ii)通过反思实现上下文策略适应,使智能体能够根据任务反馈信号调整策略而无需梯度更新。在多种环境中的实验表明,LaMer相比RL基线方法显著提升了性能,在Sokoban、MineSweeper和Webshop任务上分别实现了11%、14%和19%的性能增益。此外,与RL训练的智能体相比,LaMer在更具挑战性或先前未见任务上也表现出更好的泛化能力。总体而言,我们的结果表明,元强化学习为语言智能体提供了一种诱导探索行为的原理性方法,通过习得的探索策略实现了对新颖环境更稳健的适应。