Reinforcement learning (RL) is a framework for solving sequential decision-making problems. In this work, we demonstrate that, surprisingly, RL emerges during the inference time of large language models (LLMs), a phenomenon we term in-context RL (ICRL). To reveal this capability, we introduce a simple multi-round prompting framework, we call ICRL prompting, for inference-time self-improvement. The goal of ICRL prompting is to guide LLMs to perform reinforcement learning during inference for self-improvement on a given task. After each response, the model receives numerical scalar feedback, denoted as a reward. In the next round, we prompt the LLM again together with a context that concatenates all prior responses and their associated rewards. We consistently observe that response quality improves as the context grows. In other words, the LLM can optimize scalar reward signals during inference, exhibiting behavior analogous to reinforcement learning. We evaluate ICRL prompting on Game of 24, creative writing, ScienceWorld, and Olympiad-level math competitions (AIME and HMMT), demonstrating significant improvements over baselines such as Self-Refine and Reflexion. Notably, even when the reward signals are generated by the same LLM, ICRL prompting still improves performance, highlighting a promising new paradigm for test-time scaling.
翻译:强化学习(Reinforcement Learning, RL)是解决序列决策问题的一个框架。在这项工作中,我们证明了一个令人惊讶的现象:在大语言模型(Large Language Models, LLMs)的推理过程中,强化学习会自发涌现,我们将这一现象称为上下文强化学习(In-Context RL, ICRL)。为揭示这种能力,我们引入了一个简单的多轮提示框架,称之为ICRL提示,用于实现推理时的自我改进。ICRL提示的目标是引导大语言模型在推理过程中执行强化学习,以在给定任务上实现自我改进。在每次生成响应后,模型会收到一个数值标量反馈,即奖励。在下一轮中,我们再次提示大语言模型,并提供一个包含所有先前响应及其对应奖励的上下文。我们一致观察到,随着上下文的增长,响应质量会得到提升。换言之,大语言模型能够在推理过程中优化标量奖励信号,表现出类似于强化学习的行为。我们在24点游戏、创意写作、ScienceWorld以及奥林匹克级别的数学竞赛(AIME和HMMT)上评估了ICRL提示,结果表明其性能显著优于Self-Refine和Reflexion等基线方法。值得注意的是,即使奖励信号由同一大语言模型生成,ICRL提示仍能提升性能,这突显了一种具有前景的测试时扩展新范式。