Reinforcement Learning (RL) has shown promise for aligning Large Language Models (LLMs) to follow instructions with various constraints. Despite the encouraging results, RL improvement inevitably relies on sampling successful, high-quality responses; however, the initial model often struggles to generate responses that satisfy all constraints due to its limited capabilities, yielding sparse or indistinguishable rewards that impede learning. In this work, we propose Hindsight instruction Replay (HiR), a novel sample-efficient RL framework for complex instruction following tasks, which employs a select-then-rewrite strategy to replay failed attempts as successes based on the constraints that have been satisfied in hindsight. We perform RL on these replayed samples as well as the original ones, theoretically framing the objective as dual-preference learning at both the instruction- and response-level to enable efficient optimization using only a binary reward signal. Extensive experiments demonstrate that the proposed HiR yields promising results across different instruction following tasks, while requiring less computational budget. Our code and dataset is available at https://github.com/sastpg/HIR.
翻译:强化学习(RL)在使大语言模型(LLM)遵循各类约束的指令方面展现出潜力。尽管结果令人鼓舞,但RL的改进不可避免地依赖于采样成功、高质量的响应;然而,初始模型往往因其能力有限而难以生成满足所有约束的响应,导致奖励稀疏或难以区分,从而阻碍学习。在本工作中,我们提出了后见指令重演(HiR),一种用于复杂指令跟随任务的新型样本高效RL框架。该框架采用“选择-重写”策略,基于事后已满足的约束,将失败的尝试重演为成功案例。我们对这些重演样本以及原始样本进行RL训练,从理论上将该目标构建为指令级和响应级的双重偏好学习,从而仅使用二元奖励信号即可实现高效优化。大量实验表明,所提出的HiR在不同指令跟随任务上均取得了有希望的结果,同时所需计算资源更少。我们的代码和数据集可在 https://github.com/sastpg/HIR 获取。