In robust Markov decision processes (MDPs), the uncertainty in the transition kernel is addressed by finding a policy that optimizes the worst-case performance over an uncertainty set of MDPs. While much of the literature has focused on discounted MDPs, robust average-reward MDPs remain largely unexplored. In this paper, we focus on robust average-reward MDPs, where the goal is to find a policy that optimizes the worst-case average reward over an uncertainty set. We first take an approach that approximates average-reward MDPs using discounted MDPs. We prove that the robust discounted value function converges to the robust average-reward as the discount factor $\gamma$ goes to $1$, and moreover, when $\gamma$ is large, any optimal policy of the robust discounted MDP is also an optimal policy of the robust average-reward. We further design a robust dynamic programming approach, and theoretically characterize its convergence to the optimum. Then, we investigate robust average-reward MDPs directly without using discounted MDPs as an intermediate step. We derive the robust Bellman equation for robust average-reward MDPs, prove that the optimal policy can be derived from its solution, and further design a robust relative value iteration algorithm that provably finds its solution, or equivalently, the optimal robust policy.
翻译:在稳健的Markov决策流程(MDPs)中,解决过渡核心的不确定性的方法是找到一种政策,使最坏的 MDP业绩与一组MDP的不确定因素相比达到最坏的状态。虽然许多文献侧重于折扣的MDP,但稳健的平均回报MDP基本上仍未探索。在本文中,我们侧重于稳健的平均回报MDP,目标是找到一种政策,使最坏情况的平均奖励优于一套不确定因素。我们首先采取一种方法,即利用折扣的MDPs,将平均回报的MDPs接近于平均回报的MDPs。我们证明,稳健的折扣价值函数与稳健的平均回报函数汇合,因为折现系数$=gamma$将达1美元,此外,当$=gamma$是大的时候,任何稳健的折扣MDPs的最佳政策政策也是稳健的平均回报政策的最佳政策。我们进一步设计一个强有力的动态规划方法,从理论上将它与最佳解决方案相匹配。然后,我们在不使用折扣的MDPsmarmas作为中间步骤的情况下,我们直接调查稳健的平均平均回报的MDPsqualimalalalalalalalalal graphal dal dal dal dalmaqual 。