We introduce PPOPT - Proximal Policy Optimization using Pretraining, a novel, model-free deep-reinforcement-learning algorithm that leverages pretraining to achieve high training efficiency and stability on very small training samples in physics-based environments. Reinforcement learning agents typically rely on large samples of environment interactions to learn a policy. However, frequent interactions with a (computer-simulated) environment may incur high computational costs, especially when the environment is complex. Our main innovation is a new policy neural network architecture that consists of a pretrained neural network middle section sandwiched between two fully-connected networks. Pretraining part of the network on a different environment with similar physics will help the agent learn the target environment with high efficiency because it will leverage a general understanding of the transferrable physics characteristics from the pretraining environment. We demonstrate that PPOPT outperforms baseline classic PPO on small training samples both in terms of rewards gained and general training stability. While PPOPT underperforms against classic model-based methods such as DYNA DDPG, the model-free nature of PPOPT allows it to train in significantly less time than its model-based counterparts. Finally, we present our implementation of PPOPT as open-source software, available at github.com/Davidrxyang/PPOPT.
翻译:我们提出PPOPT——使用预训练的近似策略优化算法,这是一种新颖的无模型深度强化学习算法,它利用预训练在基于物理的环境中实现高训练效率和稳定性,即使在极小的训练样本上也是如此。强化学习智能体通常依赖大量环境交互样本来学习策略。然而,与(计算机模拟的)环境频繁交互可能会产生高昂的计算成本,尤其是在环境复杂的情况下。我们的主要创新在于一种新的策略神经网络架构,该架构由一个预训练的神经网络中间部分和两个全连接网络夹层组成。在具有相似物理特性的不同环境中对网络部分进行预训练,将有助于智能体高效学习目标环境,因为它将利用从预训练环境中获得的、可迁移的物理特性的通用理解。我们证明,在小训练样本上,PPOPT在获得的奖励和总体训练稳定性方面均优于基线经典PPO。虽然PPOPT相对于DYNA DDPG等经典基于模型的方法表现稍逊,但其无模型特性使其训练时间显著少于基于模型的对应方法。最后,我们将PPOPT的实现作为开源软件发布,可在github.com/Davidrxyang/PPOPT获取。