BEBBL: 通过重新标签经验和无人监督的培训前培训,提高反馈-有效互动强化学习 (PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training)

from arxiv, ICML 2021. First two authors contributed equally. Website: https://sites.google.com/view/icml21pebble Code: https://github.com/pokaxpoka/B_Pref

Conveying complex objectives to reinforcement learning (RL) agents can often be difficult, involving meticulous design of reward functions that are sufficiently informative yet easy enough to provide. Human-in-the-loop RL methods allow practitioners to instead interactively teach agents through tailored feedback; however, such approaches have been challenging to scale since human feedback is very expensive. In this work, we aim to make this process more sample- and feedback-efficient. We present an off-policy, interactive RL algorithm that capitalizes on the strengths of both feedback and off-policy learning. Specifically, we learn a reward model by actively querying a teacher's preferences between two clips of behavior and use it to train an agent. To enable off-policy learning, we relabel all the agent's past experience when its reward model changes. We additionally show that pre-training our agents with unsupervised exploration substantially increases the mileage of its queries. We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods, including a variety of locomotion and robotic manipulation skills. We also show that our method is able to utilize real-time human feedback to effectively prevent reward exploitation and learn new behaviors that are difficult to specify with standard reward functions.

翻译：加强学习代理人(RL)的复杂目标往往很难实现,包括仔细设计足够信息但又容易提供的奖励功能。人与圈内RL方法允许从业者通过量身定做的反馈进行互动教育代理人;然而,由于人类反馈非常昂贵,这些办法具有挑战性,因此规模很大。在这项工作中,我们的目标是使这一过程更具抽样和反馈效率。我们提出了一个脱政策、互动的RL算法,利用反馈和政策外学习的优势。具体地说,我们通过积极查询教师对两种行为剪辑的偏好而学习一个奖励模式,并利用它来培训代理人。为了能够进行脱政策学习,我们在奖励模式改变时将所有代理人过去的经验重新贴上标签。我们进一步表明,以不受监督的探索对我们的代理人进行预先培训会大大提高其查询的里程效率。我们展示我们的方法能够学习比以前人类流动方法所考虑的更复杂的任务,包括各种 Locotion和机器人操纵技能。我们还表明,我们的方法能够有效地利用实际时间反馈来学习新的报酬。