从基线开始的安全更正:通过双剂强化学习实现机器人风险意识政策 (Safety Correction from Baseline: Towards the Risk-aware Policy in Robotics via Dual-agent Reinforcement Learning)

Learning a risk-aware policy is essential but rather challenging in unstructured robotic tasks. Safe reinforcement learning methods open up new possibilities to tackle this problem. However, the conservative policy updates make it intractable to achieve sufficient exploration and desirable performance in complex, sample-expensive environments. In this paper, we propose a dual-agent safe reinforcement learning strategy consisting of a baseline and a safe agent. Such a decoupled framework enables high flexibility, data efficiency and risk-awareness for RL-based control. Concretely, the baseline agent is responsible for maximizing rewards under standard RL settings. Thus, it is compatible with off-the-shelf training techniques of unconstrained optimization, exploration and exploitation. On the other hand, the safe agent mimics the baseline agent for policy improvement and learns to fulfill safety constraints via off-policy RL tuning. In contrast to training from scratch, safe policy correction requires significantly fewer interactions to obtain a near-optimal policy. The dual policies can be optimized synchronously via a shared replay buffer, or leveraging the pre-trained model or the non-learning-based controller as a fixed baseline agent. Experimental results show that our approach can learn feasible skills without prior knowledge as well as deriving risk-averse counterparts from pre-trained unsafe policies. The proposed method outperforms the state-of-the-art safe RL algorithms on difficult robot locomotion and manipulation tasks with respect to both safety constraint satisfaction and sample efficiency.

翻译：在不结构化的机器人任务中,学习风险意识政策是必要的,但在缺乏结构的机器人任务中则具有挑战性。安全强化学习方法为解决这一问题开辟了新的可能性。然而,保守政策更新使得在复杂、抽样昂贵的环境中实现充分探索和理想业绩变得难以操作,保守政策更新使得在复杂、抽样昂贵的环境中实现充分探索和理想业绩成为棘手问题。在本文件中,我们提出了由基线和安全剂组成的双剂安全强化学习战略。这种分解框架使得基于RL的控制具有高度的灵活性、数据效率和风险意识。具体地说,基线代理商负责在标准RL设置下实现最大程度的回报。因此,它与未受限制的优化、探索和利用的非现成培训技术相匹配。另一方面,安全剂模仿政策改进的基准代理商,并学习通过非政策RL调整来达到安全限制。与从零到安全,这种安全性修正需要少得多的互动才能获得接近最佳的政策。双重政策可以通过共享的试样缓冲、或利用事先经过训练的模型或基于非学习的安全控制员作为固定基线操纵剂的不易操作者。实验结果,我们可以从事先学习、不具有风险性、不经过风险分析的方法,从而学习关于风险的方法。实验性方法,可以用来学习前的对等前的对等。实验结果。实验结果。实验结果,可以用来学习前的对等方法对等方法,可以学习不作好先学前的对等方法,以先学前的不作好的风险对等。