We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes, with the goal of achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ regret when the losses are adversarial and simultaneously $\mathcal{O}(\text{polylog}(T))$ regret when the losses are (almost) stochastic. Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known, and leaves the case of unknown transition as a major open question. In this work, we resolve this open problem by using the same Follow-the-Regularized-Leader ($\text{FTRL}$) framework together with a set of new techniques. Specifically, we first propose a loss-shifting trick in the $\text{FTRL}$ analysis, which greatly simplifies the approach of [Jin and Luo, 2020] and already improves their results for the known transition case. Then, we extend this idea to the unknown transition case and develop a novel analysis which upper bounds the transition estimation error by (a fraction of) the regret itself in the stochastic setting, a key property to ensure $\mathcal{O}(\text{polylog}(T))$ regret.
翻译:我们认为,如果损失是对抗性的,同时是负负负的,那么,当损失(几乎)是(text{polylog}(T)美元时,我们就会感到最遗憾。 [Jin和Luo,2020年]最近的工作大大简化了[Jin和Luo,2020年]的做法,并且已经改进了它们对于已知过渡案的结果。在这项工作中,我们将这一想法扩展至未知的过渡案,然后,我们将这一想法推广到这个未知的过渡案,然后,通过确定最后的错误,将最后的转变案推到最后的错误,从而确定最后的错误。