We consider learning in an adversarial Markov Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes and the state space can be arbitrarily large. We assume that the Q-function of any policy is linear in some known features, that is, a linear function approximation exists. The best existing regret upper bound for this setting (Luo et al., 2021) is of order $\tilde{\mathcal O}(K^{2/3})$ (omitting all other dependencies), given access to a simulator. This paper provides two algorithms that improve the regret to $\tilde{\mathcal O}(\sqrt K)$ in the same setting. Our first algorithm makes use of a refined analysis of the Follow-the-Regularized-Leader (FTRL) algorithm with the log-barrier regularizer. This analysis allows the loss estimators to be arbitrarily negative and might be of independent interest. Our second algorithm develops a magnitude-reduced loss estimator, further removing the polynomial dependency on the number of actions in the first algorithm and leading to the optimal regret bound (up to logarithmic terms and dependency on the horizon). Moreover, we also extend the first algorithm to simulator-free linear MDPs, which achieves $\tilde{\mathcal O}(K^{8/9})$ regret and greatly improves over the best existing bound $\tilde{\mathcal O}(K^{14/15})$. This algorithm relies on a better alternative to the Matrix Geometric Resampling procedure by Neu & Olkhovskaya (2020), which could again be of independent interest.
翻译:我们考虑在敌对的Markov 判定程序( MDP) 中学习, 损失函数可以任意改变$K$的发生, 而国家空间也可以任意地大。 我们假设任何政策的Q函数功能在某些已知的特性中是线性, 也就是说, 线性函数近似存在。 对此设置的最佳现有遗憾( Luo 等人, 2021) 与日志屏障常规化器相比, 最高约束的算法是 $\ tilde\ mathcal O} (K}2/3}) (缩小所有其他依赖关系) 。 本文提供了两种算法, 可以在同一背景下改善对 $tilde_ macal 的遗憾。 我们的第一个算法是更精确的Oraldeal- deqral- oral- laboral- kal- laxl) 。 这个算法让损失估测算器再次被任意否定, 并且可能具有独立的兴趣。 我们的第二个算法将一个量化损失估测算器, 进一步消除对美元 ornocial- loalalalalalal ral_ ral ral- ral- ral- sal lax, lax to the fal- fal- lax to lax a lax axxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx