双向多边发展方案的政策优化:通过固定的奖金改进探索 (Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses)

Policy optimization is a widely-used method in reinforcement learning. Due to its local-search nature, however, theoretical guarantees on global optimality often rely on extra assumptions on the Markov Decision Processes (MDPs) that bypass the challenge of global exploration. To eliminate the need of such assumptions, in this work, we develop a general solution that adds dilated bonuses to the policy update to facilitate global exploration. To showcase the power and generality of this technique, we apply it to several episodic MDP settings with adversarial losses and bandit feedback, improving and generalizing the state-of-the-art. Specifically, in the tabular case, we obtain $\widetilde{\mathcal{O}}(\sqrt{T})$ regret where $T$ is the number of episodes, improving the $\widetilde{\mathcal{O}}({T}^{2/3})$ regret bound by Shani et al. (2020). When the number of states is infinite, under the assumption that the state-action values are linear in some low-dimensional features, we obtain $\widetilde{\mathcal{O}}({T}^{2/3})$ regret with the help of a simulator, matching the result of Neu and Olkhovskaya (2020) while importantly removing the need of an exploratory policy that their algorithm requires. When a simulator is unavailable, we further consider a linear MDP setting and obtain $\widetilde{\mathcal{O}}({T}^{14/15})$ regret, which is the first result for linear MDPs with adversarial losses and bandit feedback.

翻译：政策优化是一种广泛用于强化学习的方法。然而,由于其本地搜索性质,全球最佳性的理论保障往往依赖于绕过全球探索挑战的Markov决定进程的额外假设。为了消除这种假设的需要,我们在此工作中开发了一个总体解决方案,为政策更新增加扩大奖金,以促进全球探索。为了展示这一技术的力量和普遍性,我们将其应用到几个带有对抗性损失和强盗反馈、改善和普及国家现状的外向MDP环境。具体地说,在表格中,我们获得的美元是超越全球探索挑战的Markov决定进程的额外假设。为了消除这种假设的需要,我们制定了一个总体解决方案,为政策更新增加了扩大奖金,以促进全球探索。为了展示这一技术的力量和普遍性,我们将其应用到几个带有对抗性损失和强势反馈的Shani et al. (2020) 的国家数量是无限的,假设国家行动价值在某些低度特征中是线性的,我们获得了超大范围Tilde{(sqral{O{T}($) 遗憾, 美元是Sloadaldealdeal=O_2) IMextial 政策需要进一步的排序。