Standard reinforcement learning (RL) algorithms train agents to maximize given reward functions. However, many real-world applications of RL require agents to also satisfy certain constraints which may, for example, be motivated by safety concerns. Constrained RL algorithms approach this problem by training agents to maximize given reward functions while respecting \textit{explicitly} defined constraints. However, in many cases, manually designing accurate constraints is a challenging task. In this work, given a reward function and a set of demonstrations from an expert that maximizes this reward function while respecting \textit{unknown} constraints, we propose a framework to learn the most likely constraints that the expert respects. We then train agents to maximize the given reward function subject to the learned constraints. Previous works in this regard have either mainly been restricted to tabular settings or specific types of constraints or assume knowledge of transition dynamics of the environment. In contrast, we empirically show that our framework is able to learn arbitrary \textit{Markovian} constraints in high-dimensions in a model-free setting.
翻译:标准强化学习(RL)算法培训代理员以最大限度地发挥奖励功能。然而,许多实际应用RL要求代理员也满足某些可能出于安全考虑的制约因素。培训代理商通过约束RL算法解决这一问题,最大限度地发挥给予的奖励功能,同时尊重定义的制约因素。然而,在许多情况下,手工设计准确的制约因素是一项艰巨的任务。在这项工作中,鉴于奖励功能和专家的一组示范,在尊重\textit{未知的}制约因素的同时最大限度地发挥这一奖励功能,我们提议了一个框架,以了解专家最可能尊重的制约因素。然后,我们培训代理商在学习到的限制条件下最大限度地发挥给定的奖励功能。这方面的以往工作主要局限于表格设置或特定类型的制约因素,或承担环境转型动态的知识。相比之下,我们的经验显示,我们的框架能够在无模式环境下学习高差异中的任意的 textit{Markovian}限制。