When learning common skills like driving, beginners usually have domain experts standing by to ensure the safety of the learning process. We formulate such learning scheme under the Expert-in-the-loop Reinforcement Learning where a guardian is introduced to safeguard the exploration of the learning agent. While allowing the sufficient exploration in the uncertain environment, the guardian intervenes under dangerous situations and demonstrates the correct actions to avoid potential accidents. Thus ERL enables both exploration and expert's partial demonstration as two training sources. Following such a setting, we develop a novel Expert Guided Policy Optimization (EGPO) method which integrates the guardian in the loop of reinforcement learning. The guardian is composed of an expert policy to generate demonstration and a switch function to decide when to intervene. Particularly, a constrained optimization technique is used to tackle the trivial solution that the agent deliberately behaves dangerously to deceive the expert into taking over. Offline RL technique is further used to learn from the partial demonstration generated by the expert. Safe driving experiments show that our method achieves superior training and test-time safety, outperforms baselines with a substantial margin in sample efficiency, and preserves the generalizabiliy to unseen environments in test-time. Demo video and source code are available at: https://decisionforce.github.io/EGPO/
翻译:在学习驾车等共同技能时,初学者通常有域专家站立,以确保学习过程的安全。我们根据 " 专家在业强化学习 " 制定这种学习计划,引入监护人以保障学习代理人的探索;在允许在不确定环境中进行充分探索的同时,监护人在危险情况下进行干预,并展示避免潜在事故的正确行动。因此,ERL使勘探和专家部分演示成为两个培训来源。在这种环境下,我们开发了一种新型专家指导政策优化方法,将监护人纳入强化学习循环中。监护人由专家政策组成,以生成演示和转换功能,以决定何时进行干预。特别是,一种有限的优化技术用于解决代理人故意以危险方式欺骗专家上台的无关紧要的解决办法。离线RL技术还被进一步用于从专家的部分演示中学习。安全驾驶实验显示,我们的方法实现了更高级的培训和测试时间安全,在抽样学习过程中将保护者纳入基线,在取样效率方面有相当大的差距,并保存通用的图像化技术,特别是用来解决该代理人故意欺骗专家接手的无关紧要的解决办法。ASIA/GEPA/Simemologyal 测试源环境。