We study the problem of Safe Policy Improvement (SPI) under constraints in the offline Reinforcement Learning (RL) setting. We consider the scenario where: (i) we have a dataset collected under a known baseline policy, (ii) multiple reward signals are received from the environment inducing as many objectives to optimize. We present an SPI formulation for this RL setting that takes into account the preferences of the algorithm's user for handling the trade-offs for different reward signals while ensuring that the new policy performs at least as well as the baseline policy along each individual objective. We build on traditional SPI algorithms and propose a novel method based on Safe Policy Iteration with Baseline Bootstrapping (SPIBB, Laroche et al., 2019) that provides high probability guarantees on the performance of the agent in the true environment. We show the effectiveness of our method on a synthetic grid-world safety task as well as in a real-world critical care context to learn a policy for the administration of IV fluids and vasopressors to treat sepsis.
翻译:我们研究了在离线强化学习(RL)设置的限制下的安全政策改进问题,我们考虑了以下情况:(一)我们根据已知基线政策收集了数据集,(二)从环境中收到多种奖励信号,从而产生实现最佳目标的众多目标,我们为这一区域环境提出了安全政策改进方案,其中考虑到算法使用者在处理不同奖励信号的取舍方面的偏好,同时确保新政策至少按照每个目标执行基线政策,我们利用传统的SPI算法,并提出了以基线推进的安全政策渗透为基础的新颖方法(SPIB、Laroche等人,2019年),该方法为代理人在真实环境中的性能提供了很高的概率保证,我们展示了我们的方法在合成网络世界安全任务以及在现实世界的关键护理环境中的有效性,以学习管理IV液体和血管压抑剂的政策来治疗呼吸系统。