In the Bayesian reinforcement learning (RL) setting, a prior distribution over the unknown problem parameters -- the rewards and transitions -- is assumed, and a policy that optimizes the (posterior) expected return is sought. A common approximation, which has been recently popularized as meta-RL, is to train the agent on a sample of $N$ problem instances from the prior, with the hope that for large enough $N$, good generalization behavior to an unseen test instance will be obtained. In this work, we study generalization in Bayesian RL under the probably approximately correct (PAC) framework, using the method of algorithmic stability. Our main contribution is showing that by adding regularization, the optimal policy becomes stable in an appropriate sense. Most stability results in the literature build on strong convexity of the regularized loss -- an approach that is not suitable for RL as Markov decision processes (MDPs) are not convex. Instead, building on recent results of fast convergence rates for mirror descent in regularized MDPs, we show that regularized MDPs satisfy a certain quadratic growth criterion, which is sufficient to establish stability. This result, which may be of independent interest, allows us to study the effect of regularization on generalization in the Bayesian RL setting.
翻译:在Bayesian加固学习(RL)设置中,假定了对未知问题参数 -- -- 奖赏和过渡 -- -- 的先前分布,并寻求优化预期回报的政策。一个共同近似(最近被作为元RL广为宣传)是,对代理方进行关于以前出现问题案例的抽样培训,希望对足够大的N美元来说,将良好的概括行为推广到一个看不见的测试实例。在这项工作中,我们利用算法稳定性的方法,在可能大致正确的框架(PAC)下对Bayesian RL进行概括化研究。我们的主要贡献表明,通过增加正规化,最佳政策在适当意义上变得稳定。文献中的大多数稳定成果都建立在正规化损失的强烈共性之上 -- -- 这种方法不适合RL(MDPs)决策程序。相反,在常规化的MDPs中镜像血统快速趋同率的最新结果的基础上,我们发现,正规化的MDPs满足了某种四边增长标准,这足以建立正常化状态。