Learning Markov decision processes (MDP) in an adversarial environment has been a challenging problem. The problem becomes even more challenging with function approximation, since the underlying structure of the loss function and transition kernel are especially hard to estimate in a varying environment. In fact, the state-of-the-art results for linear adversarial MDP achieve a regret of $\tilde{O}(K^{6/7})$ ($K$ denotes the number of episodes), which admits a large room for improvement. In this paper, we investigate the problem with a new view, which reduces linear MDP into linear optimization by subtly setting the feature maps of the bandit arms of linear optimization. This new technique, under an exploratory assumption, yields an improved bound of $\tilde{O}(K^{4/5})$ for linear adversarial MDP without access to a transition simulator. The new view could be of independent interest for solving other MDP problems that possess a linear structure.
翻译:在对抗性环境下,学习Markov(MDP)决策程序(MDP)是一个具有挑战性的问题,问题随着功能近似而变得更加棘手,因为损失函数和过渡内核的基本结构在不同的环境中尤其难以估计。事实上,线性对抗性对抗性MDP最先进的结果导致对美元(tilde{O}(K ⁇ 6/7})的遗憾,美元(K ⁇ 6/7})表示事件数量),这允许有很大的改进空间。在本文中,我们用一种新观点来调查这一问题,这种新观点将线性MDP降低为线性优化,其方法是以精密的方式设定线性优化的波段臂特征图。根据探索性假设,这种新技术为线性对线性对抗性对抗性MDP提供了更好的约束 $(tilde{O}(K ⁇ 4/5})美元(K ⁇ 4/5})。新的观点可能对解决具有线性结构的其他MDP问题具有独立的兴趣。