Reinforcement Learning has suffered from poor reward specification, and issues for reward hacking even in simple enough domains. Preference Based Reinforcement Learning attempts to solve the issue by utilizing binary feedbacks on queried trajectory pairs by a human in the loop indicating their preferences about the agent's behavior to learn a reward model. In this work, we present a state augmentation technique that allows the agent's reward model to be robust and follow an invariance consistency that significantly improved performance, i.e. the reward recovery and subsequent return computed using the learned policy over our baseline PEBBLE. We validate our method on three domains, Mountain Car, a locomotion task of Quadruped-Walk, and a robotic manipulation task of Sweep-Into, and find that using the proposed augmentation the agent not only benefits in the overall performance but does so, quite early in the agent's training phase.
翻译:强化学习因奖赏要求不高,甚至简单的领域也存在奖赏黑客的问题。 优惠基础强化学习试图通过使用一个在循环中的人对被询问的轨迹配对的二进制反馈来解决这一问题,这表明他们偏好于代理人的行为来学习奖赏模式。 在这项工作中,我们提出了一个州强化技术,使代理人的奖赏模式能够稳健,并遵循显著改善业绩的不一致性,即利用我们基线PEBBLE的学习政策来计算奖赏回收和随后的回报。 我们验证了我们在三个领域的方法,即山地汽车、Quarrupped-Walk的移动任务和Sweep-Into的机器人操纵任务,并发现使用拟议的增强代理人不仅在总体业绩中有利,而且在代理人的培训阶段很早就这样做了。