In this work, we study the simple yet universally applicable case of reward shaping in value-based Deep Reinforcement Learning (DRL). We show that reward shifting in the form of the linear transformation is equivalent to changing the initialization of the $Q$-function in function approximation. Based on such an equivalence, we bring the key insight that a positive reward shifting leads to conservative exploitation, while a negative reward shifting leads to curiosity-driven exploration. Accordingly, conservative exploitation improves offline RL value estimation, and optimistic value estimation improves exploration for online RL. We validate our insight on a range of RL tasks and show its improvement over baselines: (1) In offline RL, the conservative exploitation leads to improved performance based on off-the-shelf algorithms; (2) In online continuous control, multiple value functions with different shifting constants can be used to tackle the exploration-exploitation dilemma for better sample efficiency; (3) In discrete control tasks, a negative reward shifting yields an improvement over the curiosity-based exploration method.
翻译:在这项工作中,我们研究了在基于价值的深强化学习(DRL)中形成奖赏的简单而普遍适用的案例。我们证明,以线性转变为形式的奖赏转移相当于改变功能近似中Q美元功能的初始化。基于这种等值,我们提出关键见解,即积极的奖赏转移会导致保守开发,而消极的奖赏转移则导致好奇心驱动勘探。因此,保守的开发改善了脱线的RL价值估计,乐观的值估计改善了在线RL的勘探。我们验证了我们对一系列RL任务的洞察,并表明其比基线的改进:(1) 在离线的RL,保守的利用导致基于现成算法的性能改善;(2) 在在线连续控制中,可以使用不同变化常数的多重价值功能来解决勘探-开发难题,提高采样效率。(3) 在分散的控制任务中,负面的奖励转移导致对基于好奇心勘探方法的改进。