Safe reinforcement learning (RL) trains a constraint satisfaction policy by interacting with the environment. We aim to tackle a more challenging problem: learning a safe policy from an offline dataset. We study the offline safe RL problem from a novel multi-objective optimization perspective and propose the $\epsilon$-reducible concept to characterize problem difficulties. The inherent trade-offs between safety and task performance inspire us to propose the constrained decision transformer (CDT) approach, which can dynamically adjust the trade-offs during deployment. Extensive experiments show the advantages of the proposed method in learning an adaptive, safe, robust, and high-reward policy. CDT outperforms its variants and strong offline safe RL baselines by a large margin with the same hyperparameters across all tasks, while keeping the zero-shot adaptation capability to different constraint thresholds, making our approach more suitable for real-world RL under constraints.
翻译:安全强化学习(RL)通过与环境互动来培养一种约束性满意度政策。 我们的目标是解决一个更具挑战性的问题:从离线数据集中学习一种安全的政策。 我们从新的多目标优化角度研究离线安全RL问题,并提出以美元为减值的概念来说明问题的困难。 安全和任务业绩之间的内在权衡激励我们提出限制性决策变压器(CDT)方法,该方法可以在部署期间动态调整权衡。 广泛的实验显示了拟议方法在学习适应性、安全性、稳健性和高回报性政策方面的优势。 CDT超越其变式和强大的离线安全RL基线,在所有任务中以相同的超参数进行大范围调整,同时将零点适应能力保持在不同的制约阈值上,使我们的方法在制约下更适合现实世界的RL。