In reinforcement learning, robust policies for high-stakes decision-making problems with limited data are usually computed by optimizing the percentile criterion, which minimizes the probability of a catastrophic failure. Unfortunately, such policies are typically overly conservative as the percentile criterion is non-convex, difficult to optimize, and ignores the mean performance. To overcome these shortcomings, we study the soft-robust criterion, which uses risk measures to balance the mean and percentile criteria better. In this paper, we establish the soft-robust criterion's fundamental properties, show that it is NP-hard to optimize, and propose and analyze two algorithms to optimize it approximately. Our theoretical analyses and empirical evaluations demonstrate that our algorithms compute much less conservative solutions than the existing approximate methods for optimizing the percentile-criterion.
翻译:在强化学习中,用有限数据来应对高压决策问题的稳健政策通常通过优化百分位标准来计算,该标准能最大限度地减少灾难性失败的概率。 不幸的是,此类政策通常过于保守,因为百分位标准是非简便标准,难以优化,而且忽略了平均性能。为了克服这些缺陷,我们研究了软硬体标准,该标准使用风险度量来更好地平衡中值和百分位标准。在本文中,我们建立了软硬体标准的基本特性,表明优化是难以做到的,并提议和分析两种算法来大致优化它。我们的理论分析和经验评估表明,我们的算法计算保守性远低于优化百分位侵蚀的现有近似方法。