Model-based Reinforcement Learning (MBRL) has been widely adapted due to its sample efficiency. However, existing worst-case regret analysis typically requires optimistic planning, which is not realistic in general. In contrast, motivated by the theory, empirical study utilizes ensemble of models, which achieve state-of-the-art performance on various testing environments. Such deviation between theory and empirical study leads us to question whether randomized model ensemble guarantee optimism, and hence the optimal worst-case regret? This paper partially answers such question from the perspective of reward randomization, a scarcely explored direction of exploration with MBRL. We show that under the kernelized linear regulator (KNR) model, reward randomization guarantees a partial optimism, which further yields a near-optimal worst-case regret in terms of the number of interactions. We further extend our theory to generalized function approximation and identified conditions for reward randomization to attain provably efficient exploration. Correspondingly, we propose concrete examples of efficient reward randomization. To the best of our knowledge, our analysis establishes the first worst-case regret analysis on randomized MBRL with function approximation.
翻译:以模型为基础的强化学习(MBRL)因其抽样效率而得到广泛调整,然而,现有的最坏情况遗憾分析通常要求乐观的规划,而这种规划一般并不现实。相反,根据理论,经验研究利用了各种模型的组合,这些模型在各种测试环境中达到最先进的表现。这种理论和经验研究之间的偏差导致我们质疑随机化模型整体保证乐观,因而也是最佳最坏情况遗憾?本文部分地从奖励随机化的角度回答了这种问题,这是与MBRL进行探索的很少探索的方向。我们表明,在核心线性调节器(KNR)模式下,奖励随机化保证了部分乐观,从而在互动次数方面产生了近于最佳的最坏情况的遗憾。我们进一步将我们的理论扩展至普遍功能近似近,并确定了奖励随机化的条件,以达到可调和高效的勘探。我们相应地提出了有效随机化的具体例子。我们所了解的最好的是,我们的分析确定了对随机化的MBRL功能进行第一次最坏的遗憾分析。