Robust Markov decision processes (MDPs) aim to handle changing or partially known system dynamics. To solve them, one typically resorts to robust optimization methods. However, this significantly increases computational complexity and limits scalability in both learning and planning. On the other hand, regularized MDPs show more stability in policy learning without impairing time complexity. Yet, they generally do not encompass uncertainty in the model dynamics. In this work, we aim to learn robust MDPs using regularization. We first show that regularized MDPs are a particular instance of robust MDPs with uncertain reward. We thus establish that policy iteration on reward-robust MDPs can have the same time complexity as on regularized MDPs. We further extend this relationship to MDPs with uncertain transitions: this leads to a regularization term with an additional dependence on the value function. We then generalize regularized MDPs to twice regularized MDPs ($\text{R}^2$ MDPs), i.e., MDPs with $\textit{both}$ value and policy regularization. The corresponding Bellman operators enable us to derive planning and learning schemes with convergence and generalization guarantees, thus reducing robustness to regularization. We numerically show this two-fold advantage on tabular and physical domains, highlighting the fact that $\text{R}^2$ preserves its efficacy in continuous environments.
翻译:Robust Markov 决策程序(MDPs) 旨在处理不断变化的或部分已知的系统动态。 要解决这些问题, 通常要采用稳健的优化方法。 但是, 这会大大增加计算复杂性, 并限制学习和规划两方面的可扩展性。 另一方面, 正规化的 MDPs 显示政策学习更加稳定, 而不会影响时间复杂性。 但是, 它们通常并不包含模型动态的不确定性。 在这项工作中, 我们的目标是利用正规化来学习强健的 MDPs ($\ text{R<unk> 2$ MDPs) 。 我们首先显示, 正规化的 MDPs 是强健健的MDPs 的特例, 有不确定的奖赏。 我们由此确定, 奖赏- robust MDPs 的政策转换可以与正规化的 MDPs 具有相同的时间复杂性 。 我们进一步将这一关系扩大到不确定的 MDPs 正规化期, 并更加依赖价值 。 然后我们把正规化的 MDPs 推广到两倍的 MDPs ($\ textitriet{boltital} 值) 价值和政策规范。 我们对应的 Bellmanman 操作者能够展示 和持续的常规化的系统 。</s>