镜镜学习:统一政策优化框架 (Mirror Learning: A Unifying Framework of Policy Optimisation)

Modern deep reinforcement learning (RL) algorithms are motivated by either the general policy improvement (GPI) or trust-region learning (TRL) frameworks. However, algorithms that strictly respect these theoretical frameworks have proven unscalable. Surprisingly, the only known scalable algorithms violate the GPI/TRL assumptions, e.g. due to required regularisation or other heuristics. The current explanation of their empirical success is essentially "by analogy": they are deemed approximate adaptations of theoretically sound methods. Unfortunately, studies have shown that in practice these algorithms differ greatly from their conceptual ancestors. In contrast, in this paper we introduce a novel theoretical framework, named Mirror Learning, which provides theoretical guarantees to a large class of algorithms, including TRPO and PPO. While the latter two exploit the flexibility of our framework, GPI and TRL fit in merely as pathologically restrictive corner cases thereof. This suggests that the empirical performance of state-of-the-art methods is a direct consequence of their theoretical properties, rather than of aforementioned approximate analogies. Mirror learning sets us free to boldly explore novel, theoretically sound RL algorithms, a thus far uncharted wonderland.

翻译：现代深入强化学习(RL)算法的动因是总体政策改进(GPI)或信任区域学习(TRL)框架,然而,严格尊重这些理论框架的算法已证明是无法伸缩的。令人惊讶的是,唯一已知的可缩放算法违反了GPI/TRL假设,例如,由于要求的正规化或其他超自然学,因此,唯一的已知可缩放算法违反了GPI/TRL假设。目前对其经验成功的解释基本上是“通过类推”来解释:它们被认为是对理论上合理方法的近似调整。不幸的是,研究表明,在实践中,这些算法与其概念上的先辈大不相同。相比之下,在本文件中,我们引入了一个新的理论框架,叫做镜中学习,为包括TRPO和PO在内的一大批算法提供了理论上的保证。而后两种算法利用了我们框架的灵活性,即GPI和TRL仅仅作为病理上限制性的转角案例。这表明,状态方法的经验表现是其理论特性的直接结果,而不是上述近似相似的类比。