Safety in goal directed Reinforcement Learning (RL) settings has typically been handled through constraints over trajectories and have demonstrated good performance in primarily short horizon tasks (goal is not too far away). In this paper, we are specifically interested in the problem of solving temporally extended decision making problems such as (1) robots that have to clean different areas in a house while avoiding slippery and unsafe areas (e.g., stairs) and retaining enough charge to move to a charging dock; (2) autonomous electric vehicles that have to reach a far away destination while having to optimize charging locations along the way; in the presence of complex safety constraints. Our key contribution is a (safety) Constrained Planning with Reinforcement Learning (CoP-RL) mechanism that combines a high-level constrained planning agent (which computes a reward maximizing path from a given start to a far away goal state while satisfying cost constraints) with a low-level goal conditioned RL agent (which estimates cost and reward values to move between nearby states). A major advantage of CoP-RL is that it can handle constraints on the cost value distribution (e.g., on Conditional Value at Risk, CVaR, and also on expected value). We perform extensive experiments with different types of safety constraints to demonstrate the utility of our approach over leading best approaches in constrained and hierarchical RL.
翻译:在本文中,我们特别关心解决长期长期决策问题的问题,例如:(1)机器人必须在房屋内清理不同区域,同时避免滑滑和不安全地区(例如楼梯),并保留足够的电费,以便搬迁到收费码头;(2)自主电动车辆必须到达遥远的目的地,同时必须沿途优化收费地点;在复杂的安全制约下,自动电动车辆必须到达目的地;我们的主要贡献是(安全)通过强化学习规划(COP-RL)机制,将高层次限制规划剂(从一个特定开始到一个遥远的目标状态,同时满足成本限制)结合起来,以低水平目标为条件的RL代理(估计成本和奖励值,以便沿途移动;在复杂的安全制约下,我们的主要贡献是它能够处理成本价值分配方面的制约因素(例如,安全)通过强化学习规划(COP-RL)机制,将高水平规划剂(COP-RL)结合起来(从一个特定开始到遥远的目标状态,同时满足成本限制),以低水平目标为条件(估计成本和报酬值,在邻近各州之间移动;CP-L系统的主要好处是它能够处理成本分配的制约(例如,在高层次上,在最佳风险上进行最佳风险和高层次上进行最佳风险的实验)。