Robust Markov Decision Processes (MDPs) are getting more attention for learning a robust policy which is less sensitive to environment changes. There are an increasing number of works analyzing sample-efficiency of robust MDPs. However, most works study robust MDPs in a model-based regime, where the transition probability needs to be estimated and requires $\mathcal{O}(|\mathcal{S}|^2|\mathcal{A}|)$ storage in memory. A common way to solve robust MDPs is to formulate them as a distributionally robust optimization (DRO) problem. However, solving a DRO problem is non-trivial, so prior works typically assume a strong oracle to obtain the optimal solution of the DRO problem easily. To remove the need for an oracle, we first transform the original robust MDPs into an alternative form, as the alternative form allows us to use stochastic gradient methods to solve the robust MDPs. Moreover, we prove the alternative form still preserves the role of robustness. With this new formulation, we devise a sample-efficient algorithm to solve the robust MDPs in a model-free regime, from which we benefit lower memory space $\mathcal{O}(|\mathcal{S}||\mathcal{A}|)$ without using the oracle. Finally, we validate our theoretical findings via numerical experiments and show the efficiency to solve the alternative form of robust MDPs.
翻译:(mDPs) 学习一种对环境变化不太敏感的稳健政策越来越受到更多关注。 分析稳健的MDPs抽样效率的工作越来越多。 然而, 多数工作在基于模型的制度下研究稳健的MDPs, 需要估算过渡概率, 并需要在记忆中存储$\mathcal{O}( mathcal{S ⁇ 2 ⁇ mathcal{A ⁇ }) 。 解决稳健的MDPs的一个共同方式是将它们发展成一个分布式强力优化( DRO)问题。 然而, 解决DRO问题不是三重力的, 所以先前的工作通常会假设一个强力的神器, 以便很容易地获得对DRO问题的最佳解决方案。 为了消除对一个神器的需要, 我们首先将原来的稳健健的MDPs 变成一种替代形式, 因为替代形式允许我们使用随机的梯变梯度梯度方法来解决稳健的MDPsurity( DRO) 。 此外, 我们证明另一种形式仍然保留稳健的功能。 。 但是, 我们用这个新的配方, 我们设计一个节制的节制的节制的节制算算算算算算算算算法, 来解决稳健健健健健健健健健健的MDPs mDPs malmacalalalal___ lax lax froalalalalal fors froalalal proal ps fal proal pal proal