This paper presents a distributionally robust Q-Learning algorithm (DrQ) which leverages Wasserstein ambiguity sets to provide idealistic probabilistic out-of-sample safety guarantees during online learning. First, we follow past work by separating the constraint functions from the principal objective to create a hierarchy of machines which estimate the feasible state-action space within the constrained Markov decision process (CMDP). DrQ works within this framework by augmenting constraint costs with tightening offset variables obtained through Wasserstein distributionally robust optimization (DRO). These offset variables correspond to worst-case distributions of modeling error characterized by the TD-errors of the constraint Q-functions. This procedure allows us to safely approach the nominal constraint boundaries. Using a case study of lithium-ion battery fast charging, we explore how idealistic safety guarantees translate to generally improved safety relative to conventional methods.
翻译:本文介绍了一种分布上稳健的Q-Learn算法(DrQ),它利用瓦塞斯坦语的模糊性,在网上学习期间提供理想主义的概率超出抽样安全保障。首先,我们跟踪以往的工作,将制约功能与主要目标区分开来,以建立一套机器的等级,在受限制的Markov决策过程中估计可行的国家行动空间。DrQ在这个框架内工作,通过收紧通过瓦塞斯坦语分配上稳健的优化(DRO)获得的抵消变量来增加限制成本。这些抵消变量与受限制功能的TD-rors特征的模型错误最坏的分布相匹配。这一程序使我们能够安全接近名义约束界限。我们通过对锂离子电池快速充电进行案例研究,探索理想的安全保障如何转化为相对于常规方法的总体安全改善。