Stochastic and soft optimal policies resulting from entropy-regularized Markov decision processes (ER-MDP) are desirable for exploration and imitation learning applications. Motivated by the fact that such policies are sensitive with respect to the state transition probabilities, and the estimation of these probabilities may be inaccurate, we study a robust version of the ER-MDP model, where the stochastic optimal policies are required to be robust with respect to the ambiguity in the underlying transition probabilities. Our work is at the crossroads of two important schemes in reinforcement learning (RL), namely, robust MDP and entropy regularized MDP. We show that essential properties that hold for the non-robust ER-MDP and robust unregularized MDP models also hold in our settings, making the robust ER-MDP problem tractable. We show how our framework and results can be integrated into different algorithmic schemes including value or (modified) policy iteration, which would lead to new robust RL and inverse RL algorithms to handle uncertainties. Analyses on computational complexity and error propagation under conventional uncertainty settings are also provided.
翻译:由于这种政策对国家过渡概率十分敏感,对这些概率的估计可能不准确,我们研究ER-MDP模型的稳健版本,在这个模型中,要求随机最佳政策在基本的过渡概率的模糊性方面具有稳健性。我们的工作处于两个重要的强化学习计划(RL)的交叉点,这两个计划是:强健的MDP和恒正的MDP。我们显示,非机器人ER-MDP和强健的无正规MDP模型在我们的环境下也具有必要的特性,使强健的ER-MDP问题易于移动。我们表明,如何将我们的框架和结果纳入不同的算法计划,包括价值或(经修改的)政策循环,从而导致新的稳健的RL和逆流的RL算法,从而处理不确定性。我们还提供了在常规不确定性环境下对计算复杂性和传播错误的分析。