Despite rapid progress in theoretical reinforcement learning (RL) over the last few years, most of the known guarantees are worst-case in nature, failing to take advantage of structure that may be known a priori about a given RL problem at hand. In this paper we address the question of whether worst-case lower bounds for regret in online learning of Markov decision processes (MDPs) can be circumvented when information about the MDP, in the form of predictions about its optimal $Q$-value function, is given to the algorithm. We show that when the predictions about the optimal $Q$-value function satisfy a reasonably weak condition we call distillation, then we can improve regret bounds by replacing the set of state-action pairs with the set of state-action pairs on which the predictions are grossly inaccurate. This improvement holds for both uniform regret bounds and gap-based ones. Further, we are able to achieve this property with an algorithm that achieves sublinear regret when given arbitrary predictions (i.e., even those which are not a distillation). Our work extends a recent line of work on algorithms with predictions, which has typically focused on simple online problems such as caching and scheduling, to the more complex and general problem of reinforcement learning.
翻译:尽管过去几年在理论强化学习(RL)方面取得了快速进展,但大多数已知的保障在过去几年里是最为糟糕的,没有利用可能预先知道的关于当前特定RL问题的结构。在本文中,我们讨论了在网上学习Markov决策程序(MDPs)时最差的、最差的、最遗憾的界限的问题,如果以预测最佳美元价值功能的形式向算法提供有关MDP的信息,那么,这种信息就能够回避。我们表明,当对最佳Q$价值功能的预测满足一个相当弱的条件时,我们称之为蒸馏,然后我们就可以用一套预测极不准确的州-行动对子取代一套状态-行动对子,从而改进遗憾界限。这种改进既有利于统一的遗憾界限,又有利于基于差距的信息。此外,我们可以用一种算法实现这一属性,在作出武断的预测时(即甚至不是蒸馏的)达到一个相当弱的条件。我们的工作将最近关于国家-行动对立对子的预测扩展了一条工作线,即通过一种简单、比较复杂的算法的进度问题,其典型地是,在网上学习比较复杂的、比较复杂的进度上的问题。