We address the problem of computing reliable policies in reinforcement learning problems with limited data. In particular, we compute policies that achieve good returns with high confidence when deployed. This objective, known as the \emph{percentile criterion}, can be optimized using Robust MDPs~(RMDPs). RMDPs generalize MDPs to allow for uncertain transition probabilities chosen adversarially from given ambiguity sets. We show that the RMDP solution's sub-optimality depends on the spans of the ambiguity sets along the value function. We then propose new algorithms that minimize the span of ambiguity sets defined by weighted $L_1$ and $L_\infty$ norms. Our primary focus is on Bayesian guarantees, but we also describe how our methods apply to frequentist guarantees and derive new concentration inequalities for weighted $L_1$ and $L_\infty$ norms. Experimental results indicate that our optimized ambiguity sets improve significantly on prior construction methods.
翻译:我们用有限的数据来解决计算可靠的政策,用有限的数据来强化学习问题。 特别是, 我们计算出在部署时以高度自信实现良好回报的政策。 这个称为 emph{ 百分度标准 的目标, 可以用强力 MDPs~( RMDPs) 优化。 RMDPs 将 MDPs 普遍化, 以便从给定的模棱两可的模数组中选择不确定的过渡概率。 我们显示 RMDP 解决方案的亚优度取决于值函数的模糊度。 然后我们提出新的算法, 最大限度地减少由加权的 $_1 和 $ $ 和 $ infty 标准定义的模糊度。 我们的主要重点是巴伊斯保证, 但我们也描述了我们的方法如何适用于常态的保证, 并产生加权的 $1 和 $ L ⁇ inty 规范的新的浓度不平等。 我们的实验结果表明, 我们的优化模糊性将大大改进以前的建筑方法。