We consider regret minimization for Adversarial Markov Decision Processes (AMDPs), where the loss functions are changing over time and adversarially chosen, and the learner only observes the losses for the visited state-action pairs (i.e., bandit feedback). While there has been a surge of studies on this problem using Online-Mirror-Descent (OMD) methods, very little is known about the Follow-the-Perturbed-Leader (FTPL) methods, which are usually computationally more efficient and also easier to implement since it only requires solving an offline planning problem. Motivated by this, we take a closer look at FTPL for learning AMDPs, starting from the standard episodic finite-horizon setting. We find some unique and intriguing difficulties in the analysis and propose a workaround to eventually show that FTPL is also able to achieve near-optimal regret bounds in this case. More importantly, we then find two significant applications: First, the analysis of FTPL turns out to be readily generalizable to delayed bandit feedback with order-optimal regret, while OMD methods exhibit extra difficulties (Jin et al., 2022). Second, using FTPL, we also develop the first no-regret algorithm for learning communicating AMDPs in the infinite-horizon setting with bandit feedback and stochastic transitions. Our algorithm is efficient assuming access to an offline planning oracle, while even for the easier full-information setting, the only existing algorithm (Chandrasekaran and Tewari, 2021) is computationally inefficient.
翻译:我们认为对Adversarial Markov 决策程序(AMDPs)的最小化是令人遗憾的,因为损失功能随着时间和对抗性选择而变化,而学习者只观察访问的州际行动对子的损失(即土匪反馈 ) 。 虽然使用在线-米洛尔-白日(OMD) 方法对这个问题的研究激增,但对于FTPL 方法却知之甚少,该方法通常在计算上效率更高,而且更容易实施,因为它只需要解决离线规划问题。 受此驱动,我们更仔细地观察FTPL,从标准硬度定偏差-偏差设置开始,学习访问的州际行动对对子(即土匪反馈 ) 。 我们发现在分析中发现一些独特而棘手的困难,并提议一个可以最终表明FTPL 在本案中也能够实现接近最佳的遗憾。 更重要的是,我们随后发现两种重要的应用程序:首先,FTPL的分析可以很容易被概括到延迟的 AMD 反馈,同时使用O- IMF- Ex- Ex- Ex- Ex- Ex- Ex- Ex- Ex- Ex- Ex- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- slevolviolviolviolviolviolviewal.