Learning Markov decision processes (MDPs) in the presence of the adversary is a challenging problem in reinforcement learning (RL). In this paper, we study RL in episodic MDPs with adversarial reward and full information feedback, where the unknown transition probability function is a linear function of a given feature mapping, and the reward function can change arbitrarily episode by episode. We propose an optimistic policy optimization algorithm POWERS and show that it can achieve $\tilde{O}(dH\sqrt{T})$ regret, where $H$ is the length of the episode, $T$ is the number of interactions with the MDP, and $d$ is the dimension of the feature mapping. Furthermore, we also prove a matching lower bound of $\tilde{\Omega}(dH\sqrt{T})$ up to logarithmic factors. Our key technical contributions are two-fold: (1) a new value function estimator based on importance weighting; and (2) a tighter confidence set for the transition kernel. They together lead to the nearly minimax optimal regret.
翻译:在对手在场的情况下,学习Markov(MDPs)决策程序(MDPs)是强化学习(RL)的一个棘手问题。在本文中,我们用对抗性奖赏和完整信息反馈对附带的MDPs研究RL,其中未知的过渡概率函数是特定地貌映射的线性功能,而奖励函数可以随插曲而任意改变。我们提出了一个乐观的政策优化算法POWERS,并显示它能够实现$\tilde{O}(dH\sqrt{T})的遗憾,其中H$是事件长度,$T$是与MDP的互动次数,$d$是特征映射的维度。此外,我们还证明将 $\ tilde_Omega}(dH\sqrt{T}) 的值匹配到对数因素的相对较低约束。我们的主要技术贡献有两重:(1) 基于重量的新的价值函数估计;和(2)对过渡内核的更紧密的置信任度。它们一起导致近为最小的最佳遗憾。