Tremendous progress has been made in reinforcement learning (RL) over the past decade. Most of these advancements came through the continual development of new algorithms, which were designed using a combination of mathematical derivations, intuitions, and experimentation. Such an approach of creating algorithms manually is limited by human understanding and ingenuity. In contrast, meta-learning provides a toolkit for automatic machine learning method optimisation, potentially addressing this flaw. However, black-box approaches which attempt to discover RL algorithms with minimal prior structure have thus far not outperformed existing hand-crafted algorithms. Mirror Learning, which includes RL algorithms, such as PPO, offers a potential middle-ground starting point: while every method in this framework comes with theoretical guarantees, components that differentiate them are subject to design. In this paper we explore the Mirror Learning space by meta-learning a "drift" function. We refer to the immediate result as Learnt Policy Optimisation (LPO). By analysing LPO we gain original insights into policy optimisation which we use to formulate a novel, closed-form RL algorithm, Discovered Policy Optimisation (DPO). Our experiments in Brax environments confirm state-of-the-art performance of LPO and DPO, as well as their transfer to unseen settings.
翻译:过去十年来,在强化学习(RL)方面取得了巨大的进步。这些进步大多是通过不断开发新的算法取得的,这些算法是利用数学衍生、直觉和实验的组合设计的。这种手工创建算法的方法受到人类理解和智慧的限制。相反,元学习为自动机器学习方法优化提供了工具包,有可能解决这一缺陷。然而,试图以最小的先前结构发现RL算法的黑盒方法迄今没有超过现有的手工艺算法。镜像学习,包括了RL算法,例如PPPO,提供了潜在的中层起点:虽然这个框架中的每一种方法都带有理论保证,但区别于这些算法的组成部分都取决于设计。在本文件中,我们探索镜像学习空间,学习一个“干法”功能。我们指的是直接的结果是“学习政策优化 ” (LPO)。通过分析LPO,我们获得了对政策优化的原始洞察了解,我们用来设计一个新型的、封闭的RLOL算法的算法, 并且将政策转换成“LPO-PO”的状态。