Building on the framework introduced by Xu and Raginksy [1] for supervised learning problems, we study the best achievable performance for model-based Bayesian reinforcement learning problems. With this purpose, we define minimum Bayesian regret (MBR) as the difference between the maximum expected cumulative reward obtainable either by learning from the collected data or by knowing the environment and its dynamics. We specialize this definition to reinforcement learning problems modeled as Markov decision processes (MDPs) whose kernel parameters are unknown to the agent and whose uncertainty is expressed by a prior distribution. One method for deriving upper bounds on the MBR is presented and specific bounds based on the relative entropy and the Wasserstein distance are given. We then focus on two particular cases of MDPs, the multi-armed bandit problem (MAB) and the online optimization with partial feedback problem. For the latter problem, we show that our bounds can recover from below the current information-theoretic bounds by Russo and Van Roy [2].
翻译:以Xu和Raginksy[1]为监督学习问题而采用的框架为基础,我们研究了基于模型的Bayesian强化学习问题的最佳可实现绩效。为此,我们将Bayesian最低遗憾(MBR)定义为从从所收集的数据中学习或了解环境及其动态可以获得的最大预期累积报酬之间的差别。我们专门将这一定义用于强化学习问题,其模型是Markov决定程序,其内核参数对代理人来说并不为人知,其不确定性表现在先前的分布中。提出了一种计算MBR上限的方法,并给出了基于相对恒温和瓦塞斯坦距离的具体界限。我们随后侧重于MDP的两个特定案例,即多臂土匪问题(MAB)和部分反馈问题的在线优化。关于后一个问题,我们表明我们的界限可以从Russo和Van Roy目前的信息-理论界限之下恢复[2]。