镜镜学习:统一政策优化框架 (Mirror Learning: A Unifying Framework of Policy Optimisation)

Most modern deep reinforcement learning (RL) algorithms are motivated by either the general policy improvement (GPI) or trust-region learning (TRL) frameworks. However, algorithms that strictly respect these theoretical frameworks have proven unscalable. Surprisingly, the only known scalable algorithms violate the GPI/TRL assumptions, e.g. due to required regularisation or other heuristics. The current explanation of their empirical success is essentially by "analogy": they are deemed approximate adaptations of theoretically sound methods. Unfortunately, studies have shown that in practice these algorithms differ greatly from their conceptual ancestors. In contrast, in this paper, we introduce a novel theoretical framework, named Mirror Learning, which provides theoretical guarantees to a large class of algorithms, including TRPO and PPO. While the latter two exploit the flexibility of our framework, GPI and TRL fit in merely as pathologically restrictive or impractical corner cases thereof. This suggests that the empirical performance of state-of-the-art methods is a direct consequence of their theoretical properties, rather than of aforementioned approximate analogies. Mirror learning sets us free to boldly explore novel, theoretically sound RL algorithms, a thus far uncharted wonderland.

翻译：最现代的深层强化学习(RL)算法的动因是总体政策改进(GPI)或信任区域学习(TRL)框架,然而,严格尊重这些理论框架的算法已证明是无法伸缩的。令人惊讶的是,唯一已知的可缩放算法违反了GPI/TRL假设,例如由于要求的正规化或其他超自然学。目前对其成功经验的解释基本上是由“分析”来解释:它们被认为是对理论上合理方法的近似调整。不幸的是,研究表明,这些算法在实践中与其概念祖先有很大不同。相比之下,我们在本文中引入了一个叫“镜像学习”的新理论框架,为包括TRPO和PPO在内的一大批算法提供了理论保障。虽然后两种算法利用了我们框架的灵活性,即GPI和TRL仅仅作为病理上限制性或不切实际的角落案例。这说明,国家方法的经验表现是其理论特性的直接结果,而不是上述近似相似的模拟结果。镜像学让我们可以自由地大胆地探索一个具有远方理论性、极好的RhyL算法。