This paper presents a distributionally robust Q-Learning algorithm (DrQ) which leverages Wasserstein ambiguity sets to provide probabilistic out-of-sample safety guarantees during online learning. First, we follow past work by separating the constraint functions from the principal objective to create a hierarchy of machines which estimate the feasible state-action space within the constrained Markov decision process (CMDP). DrQ works within this framework by augmenting constraint costs with tightening offset variables obtained through Wasserstein distributionally robust optimization (DRO). These offset variables correspond to worst-case distributions of modeling error characterized by the TD-errors of the constraint Q-functions. This procedure allows us to safely approach the nominal constraint boundaries with strong probabilistic safety guarantees. Using a case study of safe lithium-ion battery fast charging, we demonstrate dramatic improvements in safety and performance relative to conventional methods.
翻译:本文介绍了一种分布上稳健的Q-Learn算法(DrQ),它利用瓦森斯坦语的模糊性,在网上学习期间提供概率超出抽样的安全保障。首先,我们跟踪以往的工作,将制约功能与创建一种机器等级的主要目的区分开来,以在受限制的Markov决策过程中估计可行的国家行动空间。DrQ在这个框架内工作,通过收紧通过瓦森斯坦语分配上稳健的优化(DRO)获得的抵消变量来增加制约成本。这些抵消变量与限制功能的TD-rors特征的模型错误最坏的分布相对。这一程序使我们能够安全地接近名义限制界限,并有很强的概率安全保障。我们通过对安全的锂离子电池快速充电进行案例研究,展示了相对于常规方法的安全和性能的巨大改善。