The design of effective reward functions presents a central and often arduous challenge in reinforcement learning (RL), particularly when developing autonomous agents for complex reasoning tasks. While automated reward optimization approaches exist, they typically rely on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between reward structure and task performance. To bridge this gap, we propose Differentiable Evolutionary Reinforcement Learning (DERL), a bilevel framework that enables the autonomous discovery of optimal reward signals. In DERL, a Meta-Optimizer evolves a reward function (i.e., Meta-Reward) by composing structured atomic primitives, guiding the training of an inner-loop policy. Crucially, unlike previous evolution, DERL is differentiable in its metaoptimization: it treats the inner-loop validation performance as a signal to update the Meta-Optimizer via reinforcement learning. This allows DERL to approximate the "meta-gradient" of task success, progressively learning to generate denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Experimental results show that DERL achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming methods relying on heuristic rewards, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling selfimproving agent alignment without human intervention.
翻译:在强化学习(RL)中,特别是在为复杂推理任务开发自主智能体时,设计有效的奖励函数是一个核心且往往艰巨的挑战。虽然存在自动化的奖励优化方法,但它们通常依赖于将奖励函数视为黑盒的无导数进化启发式算法,未能捕捉奖励结构与任务性能之间的因果关系。为了弥合这一差距,我们提出了可微分进化强化学习(DERL),这是一个双层框架,能够自主发现最优奖励信号。在DERL中,元优化器通过组合结构化的原子基元来进化奖励函数(即元奖励),从而指导内层策略的训练。关键的是,与以往的进化方法不同,DERL在其元优化过程中是可微分的:它将内层验证性能视为信号,通过强化学习来更新元优化器。这使得DERL能够近似任务成功的“元梯度”,逐步学习生成更密集且更具可操作性的反馈。我们在三个不同的领域验证了DERL:机器人智能体(ALFWorld)、科学模拟(ScienceWorld)和数学推理(GSM8k, MATH)。实验结果表明,DERL在ALFWorld和ScienceWorld上实现了最先进的性能,显著优于依赖启发式奖励的方法,尤其是在分布外场景中。对进化轨迹的分析表明,DERL成功捕捉了任务的内在结构,实现了无需人工干预的自我改进智能体对齐。