Generalization in Reinforcement Learning (RL) aims to learn an agent during training that generalizes to the target environment. This paper studies RL generalization from a theoretical aspect: how much can we expect pre-training over training environments to be helpful? When the interaction with the target environment is not allowed, we certify that the best we can obtain is a near-optimal policy in an average sense, and we design an algorithm that achieves this goal. Furthermore, when the agent is allowed to interact with the target environment, we give a surprising result showing that asymptotically, the improvement from pre-training is at most a constant factor. On the other hand, in the non-asymptotic regime, we design an efficient algorithm and prove a distribution-based regret bound in the target environment that is independent of the state-action space.
翻译:强化学习(RL)的总体化(RL)的目的是在培训期间学习一个能概括到目标环境的代理物。本文从理论角度研究了RL的概括化:我们有多少期望培训前培训环境会有所帮助?当不允许与目标环境的互动时,我们证明我们所能得到的最好的是平均意义上的接近最佳的政策,我们设计了一种实现这一目标的算法。此外,当该代理物被允许与目标环境互动时,我们给出了一个令人惊讶的结果,表明从培训前的改进充其量是一个不变的因素。另一方面,在非救济制度下,我们设计了一个高效的算法,并证明在目标环境中存在一种与状态行动空间无关的基于分配的遗憾。