Despite the tremendous success of Reinforcement Learning (RL) algorithms in simulation environments, applying RL to real-world applications still faces many challenges. A major concern is safety, in another word, constraint satisfaction. State-wise constraints are one of the most common constraints in real-world applications and one of the most challenging constraints in Safe RL. Enforcing state-wise constraints is necessary and essential to many challenging tasks such as autonomous driving, robot manipulation. This paper provides a comprehensive review of existing approaches that address state-wise constraints in RL. Under the framework of State-wise Constrained Markov Decision Process (SCMDP), we will discuss the connections, differences, and trade-offs of existing approaches in terms of (i) safety guarantee and scalability, (ii) safety and reward performance, and (iii) safety after convergence and during training. We also summarize limitations of current methods and discuss potential future directions.
翻译:尽管在模拟环境中“强化学习”算法取得了巨大成功,但将“强化学习”应用到现实世界应用中仍面临许多挑战,一个主要关切是安全,换句话说,“限制满意度”是另一个问题,国家制约是现实应用中最常见的制约因素之一,也是安全应用中最具挑战性的制约因素之一。实施“国家制约”对于自主驾驶、机器人操纵等许多具有挑战性的任务是必要和必要的。本文件全面审查了解决“强化学习”中国家制约的现有方法。在“国家控制的Markov决策程序”框架内,我们将讨论(一) 安全保障和可扩展性,(二) 安全和奖励业绩,(三) 统一和培训后的安全,我们还将总结当前方法的局限性,并讨论未来可能的方向。