In reinforcement learning (RL), an agent must explore an initially unknown environment in order to learn a desired behaviour. When RL agents are deployed in real world environments, safety is of primary concern. Constrained Markov decision processes (CMDPs) can provide long-term safety constraints; however, the agent may violate the constraints in an effort to explore its environment. This paper proposes a model-based RL algorithm called Explicit Explore, Exploit, or Escape ($E^{4}$), which extends the Explicit Explore or Exploit ($E^{3}$) algorithm to a robust CMDP setting. $E^4$ explicitly separates exploitation, exploration, and escape CMDPs, allowing targeted policies for policy improvement across known states, discovery of unknown states, as well as safe return to known states. $E^4$ robustly optimises these policies on the worst-case CMDP from a set of CMDP models consistent with the empirical observations of the deployment environment. Theoretical results show that $E^4$ finds a near-optimal constraint-satisfying policy in polynomial time whilst satisfying safety constraints throughout the learning process. We then discuss $E^4$ as a practical algorithmic framework, including robust-constrained offline optimisation algorithms, the design of uncertainty sets for the transition dynamics of unknown states, and how to further leverage empirical observations and prior knowledge to relax some of the worst-case assumptions underlying the theory.
翻译:在强化学习(RL)中,代理商必须探索最初未知的环境,以学习理想的行为。当RL代理商在现实世界环境中部署时,安全是首要问题。受约束的Markov 决策程序(CMDPs)可以提供长期安全限制;然而,代理商在探索其环境时可能违反限制。本文建议采用基于模型的RL算法,称为Explicate Explace、Exployit 或 Escape (E ⁇ 4}$),该算法将探索或探索(E ⁇ 3}$)算法扩展至一个稳健的CMDP观察设置。 $4美元明确区分了开发、探索和摆脱CMDP(CMDPs),允许在已知的各州采取有针对性的政策改善政策,发现未知的国家,以及安全返回已知国家。 $4美元将这些政策与一套与部署环境的经验观测一致,将CMCD的一套最坏的模型(E%4美元)模式相匹配。 理论结果显示,“最接近于CMCDLI-ligalalalal-lialalal-ligalalal exaltical exalizal ex ex exal exal exlievact ex ex ex exix exactal exactal ex exact exactal ex ex ex ex ex exislevolviolviolviolviolview ex ex ex ex exxlation ex ex exxlation ex exinaltical ex ex ex ex ex exx exx exxx ex exxil ex exx exx exx ex ex ex exfact ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex exactal exact ex exact exact ex ex exact exactal exactal exactal exactal ex ex ex ex ex ex ex exact exactal ex ex ex exactal exactal ex