An emerging field of sequential decision problems is safe Reinforcement Learning (RL), where the objective is to maximize the reward while obeying safety constraints. Being able to handle constraints is essential for deploying RL agents in real-world environments, where constraint violations can harm the agent and the environment. To this end, we propose a safe model-free RL algorithm with a novel multiplicative value function consisting of a safety critic and a reward critic. The safety critic predicts the probability of constraint violation and discounts the reward critic that only estimates constraint-free returns. By splitting responsibilities, we facilitate the learning task leading to increased sample efficiency. We integrate our approach into two popular RL algorithms, Proximal Policy Optimization and Soft Actor-Critic, and evaluate our method in four safety-focused environments, including classical RL benchmarks augmented with safety constraints and robot navigation tasks with images and raw Lidar scans as observations. Finally, we make the zero-shot sim-to-real transfer where a differential drive robot has to navigate through a cluttered room. Our code can be found at https://github.com/nikeke19/Safe-Mult-RL.
翻译:安全强化学习(RL)是一个新出现的连续决策问题领域,目标是在遵守安全限制的同时最大限度地获得奖励,这是安全强化学习(RL)的一个新兴领域,目标是在遵守安全加强学习(RL)的同时最大限度地获得奖赏。能够处理各种制约对于在现实世界环境中部署RL代理器至关重要,在现实世界环境中,限制违规行为会损害代理人和环境。为此,我们提议了一个安全的无模式RL算法,该算法将安全评论员和奖赏评论员组成的新颖的倍倍增价值功能。安全评论员预测了限制违反的可能性,并折扣只估计不受限制回报的奖赏评论家。通过责任分工,我们促进学习工作,从而提高样本效率。我们把我们的方法整合成两种受欢迎的RL算法,即Proximal政策优化和软动作-color-Critic。我们的方法在四个以安全为重点的环境中,包括传统的RL基准,加上安全限制和机器人导航任务以及图像和原始Ldar扫描等观察。最后,我们把零光速的Sim-to-real传输到现实,这里的驱动器必须经过一个封闭的房间。我们的代码可以在 http://githrthub.com-rke19/Muk-Se-Safe19/Serf-Sef) 。我们的代码可以在httpslt.</s>