Reward function is essential in reinforcement learning (RL), serving as the guiding signal to incentivize agents to solve given tasks, however, is also notoriously difficult to design. In many cases, only imperfect rewards are available, which inflicts substantial performance loss for RL agents. In this study, we propose a unified offline policy optimization approach, \textit{RGM (Reward Gap Minimization)}, which can smartly handle diverse types of imperfect rewards. RGM is formulated as a bi-level optimization problem: the upper layer optimizes a reward correction term that performs visitation distribution matching w.r.t. some expert data; the lower layer solves a pessimistic RL problem with the corrected rewards. By exploiting the duality of the lower layer, we derive a tractable algorithm that enables sampled-based learning without any online interactions. Comprehensive experiments demonstrate that RGM achieves superior performance to existing methods under diverse settings of imperfect rewards. Further, RGM can effectively correct wrong or inconsistent rewards against expert preference and retrieve useful information from biased rewards.
翻译:在强化学习(RL)中,奖赏功能至关重要,因为它是激励代理商解决既定任务的指导信号,然而,也很难设计。在很多情况下,只有不完善的奖赏存在,这给RL代理商造成了巨大的绩效损失。在本研究中,我们建议了一种统一的离线政策优化方法,\textit{RGM(REward 差距最小化)},这种方法可以明智地处理各种不完善的奖赏类型。RGM是一个双级优化问题:上层优化一个奖励性更正术语,该术语进行访问分配匹配一些专家数据;下层解决了悲观的RL问题和纠正的奖赏。通过利用下层的双重性,我们得出一种可移植的算法,使抽样学习无需任何在线互动。全面实验表明,RGM在不完善的奖赏环境之下,能够取得优异的成绩。此外,RGM可以有效地纠正对专家偏好或不一致的奖赏,并从有偏差的报酬中获得有用的信息。