We consider the offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset. This problem setting is appealing in many real-world scenarios, where direct interaction with the environment is costly or risky, and where the resulting policy should comply with safety constraints. However, it is challenging to compute a policy that guarantees satisfying the cost constraints in the offline RL setting, since the off-policy evaluation inherently has an estimation error. In this paper, we present an offline constrained RL algorithm that optimizes the policy in the space of the stationary distribution. Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction. Experimental results show that COptiDICE attains better policies in terms of constraint satisfaction and return-maximization, outperforming baseline algorithms.
翻译:我们考虑了离线限制强化学习(RL)问题,即代理商旨在计算一项政策,在满足成本限制的同时最大限度地实现预期回报,而只从预先收集的数据集中学习。这一问题的设定在许多现实情景中具有吸引力,因为与环境的直接互动成本高或风险高,而由此产生的政策应符合安全限制。然而,要计算一项保证满足离线强化学习(RL)设置的成本限制的政策是具有挑战性的,因为离线政策评估本身就有一个估计错误。在本文中,我们提出了一个离线限制RL算法,优化固定分布空间的政策。我们的算法,COptiDICE,直接估计了回报方面最佳政策的固定分布修正,同时限制成本上限,目的是为实际约束性满意度产生成本保护政策。实验结果表明,CoptiDICE在制约满意度和回归-最大化方面获得了更好的政策,超过了基准算法。