General policy improvement (GPI) and trust-region learning (TRL) are the predominant frameworks within contemporary reinforcement learning (RL), which serve as the core models for solving Markov decision processes (MDPs). Unfortunately, in their mathematical form, they are sensitive to modifications, and thus, the practical instantiations that implement them do not automatically inherit their improvement guarantees. As a result, the spectrum of available rigorous MDP-solvers is narrow. Indeed, many state-of-the-art (SOTA) algorithms, such as TRPO and PPO, are not proven to converge. In this paper, we propose \textsl{mirror learning} -- a general solution to the RL problem. We reveal GPI and TRL to be but small points within this far greater space of algorithms which boasts the monotonic improvement property and converges to the optimal policy. We show that virtually all SOTA algorithms for RL are instances of mirror learning, and thus suggest that their empirical performance is a consequence of their theoretical properties, rather than of approximate analogies. Excitingly, we show that mirror learning opens up a whole new space of policy learning methods with convergence guarantees.
翻译:总体政策改进(GPI)和信任区域学习(TRL)是当代强化学习(RL)的主要框架,是解决Markov决定程序的核心模式。 不幸的是,在数学形式上,它们敏感于修改,因此,实施这些修改的实际速变并不自动继承其改进保障。因此,现有严格的MDP-解析法的范围很窄。事实上,许多最先进的算法,如TRPO和PPPO,没有被证明是集合起来的。在本文件中,我们提议\textsl{mirror relearning} -- -- 这是解决RL问题的一般解决办法。我们发现GPI和TRL只是这个大得多的算法空间中的小点点,这些算法以单调改进属性为主,与最佳政策趋同。我们表明,几乎所有SOTA的算法都是镜像学习的例子,因此表明其经验性能是理论性的结果,而不是近似相似的类比。我们显示,镜像学会打开整个学习政策的新空间。