In reinforcement learning (RL), an agent must explore an initially unknown environment in order to learn a desired behaviour. When RL agents are deployed in real world environments, safety is of primary concern. Constrained Markov decision processes (CMDPs) can provide long-term safety constraints; however, the agent may violate the constraints in an effort to explore its environment. This paper proposes a model-based RL algorithm called Explicit Explore, Exploit, or Escape ($E^{4}$), which extends the Explicit Explore or Exploit ($E^{3}$) algorithm to a robust CMDP setting. $E^4$ explicitly separates exploitation, exploration, and escape CMDPs, allowing targeted policies for policy improvement across known states, discovery of unknown states, as well as safe return to known states. $E^4$ robustly optimises these policies on the worst-case CMDP from a set of CMDP models consistent with the empirical observations of the deployment environment. Theoretical results show that $E^4$ finds a near-optimal constraint-satisfying policy in polynomial time whilst satisfying safety constraints throughout the learning process. We discuss robust-constrained offline optimisation algorithms as well as how to incorporate uncertainty in transition dynamics of unknown states based on empirical inference and prior knowledge.
翻译:在强化学习(RL)中,代理商必须探索最初未知的环境,以学习理想的行为。当RL代理商在现实世界环境中部署时,安全是首要问题。受约束的Markov决策过程(CMDPs)可以提供长期安全限制;然而,代理商在探索其环境时可能违反限制。本文件建议采用模型式的RL算法,称为“Explicate Explace、Exployit or Escape (E ⁇ 4}$)”,将探索或开发(E ⁇ 3}$)算法扩展至强大的CMDP设置。 $4明确将开发、勘探和摆脱CMDP(CDPs)分开,允许在已知的各州采取有针对性的政策改进政策,发现未知国家,以及安全返回已知国家。 $4美元将这些政策与一套与部署环境经验观测一致的CMDP模式相比,对最坏的CMDP模式非常适合。 理论结果显示, $_4$发现,在可靠的CMCDDP(Wenopal-sadistical)中,我们了解了在以往的不确定性的动态中,我们如何了解了我们如何了解了稳定的动态。