This paper studies the constrained/safe reinforcement learning (RL) problem with sparse indicator signals for constraint violations. We propose a model-based approach to enable RL agents to effectively explore the environment with unknown system dynamics and environment constraints given a significantly small number of violation budgets. We employ the neural network ensemble model to estimate the prediction uncertainty and use model predictive control as the basic control framework. We propose the robust cross-entropy method to optimize the control sequence considering the model uncertainty and constraints. We evaluate our methods in the Safety Gym environment. The results show that our approach learns to complete the tasks with a much smaller number of constraint violations than state-of-the-art baselines. Additionally, we are able to achieve several orders of magnitude better sample efficiency when compared with constrained model-free RL approaches. The code is available at \url{https://github.com/liuzuxin/safe-mbrl}.
翻译:本文研究限制/安全强化学习(RL)问题,因为限制违反的指标信号稀少。我们建议采用基于模式的方法,使限制行为代理商能够有效探索环境,而由于违反预算数量少得多,系统动态和环境制约不明。我们使用神经网络共同模型来估计预测不确定性,并将模型预测控制作为基本控制框架。我们建议采用强有力的跨热带方法,以优化控制序列,同时考虑到模型不确定性和制约因素。我们评估了我们在安全健身环境中的方法。结果显示,我们的方法学会了以比最新基线少得多的限制违反次数完成任务。此外,与受限制的无模式RL方法相比,我们能够达到若干级更高的样本效率。代码可在以下网站查阅:https://github.com/liuzuxin/safe-mbrl}。