DEFT: 强化学习中快速转移的多样化组合 (DEFT: Diverse Ensembles for Fast Transfer in Reinforcement Learning)

Deep ensembles have been shown to extend the positive effect seen in typical ensemble learning to neural networks and to reinforcement learning (RL). However, there is still much to be done to improve the efficiency of such ensemble models. In this work, we present Diverse Ensembles for Fast Transfer in RL (DEFT), a new ensemble-based method for reinforcement learning in highly multimodal environments and improved transfer to unseen environments. The algorithm is broken down into two main phases: training of ensemble members, and synthesis (or fine-tuning) of the ensemble members into a policy that works in a new environment. The first phase of the algorithm involves training regular policy gradient or actor-critic agents in parallel but adding a term to the loss that encourages these policies to differ from each other. This causes the individual unimodal agents to explore the space of optimal policies and capture more of the multimodality of the environment than a single actor could. The second phase of DEFT involves synthesizing the component policies into a new policy that works well in a modified environment in one of two ways. To evaluate the performance of DEFT, we start with a base version of the Proximal Policy Optimization (PPO) algorithm and extend it with the modifications for DEFT. Our results show that the pretraining phase is effective in producing diverse policies in multimodal environments. DEFT often converges to a high reward significantly faster than alternatives, such as random initialization without DEFT and fine-tuning of ensemble members. While there is certainly more work to be done to analyze DEFT theoretically and extend it to be even more robust, we believe it provides a strong framework for capturing multimodality in environments while still using RL methods with simple policy representations.

翻译：深层集合显示,将典型的混合学习中出现的积极效果扩展到神经网络和强化学习(RL)。然而,为了提高这种混合模型的效率,还需要做很多工作。在这项工作中,我们介绍了REL(DEFT)的多样化组合快速转移方法,这是在高度多式联运环境中加强学习和改进向隐形环境转移的一种新的混合方法。算法分为两个主要阶段:联合成员的培训,以及组合成员综合(或微调)成一个在新环境下发挥作用的政策。算法的第一阶段往往需要同时培训常规政策梯度或行为者-激进剂,但还要增加一个期限,鼓励这些政策彼此不同的损失。这导致个人不成熟的代理人探索最佳政策空间,并比单一行为者能够更多地捕捉环境的多式联运。 DEFT的第二阶段涉及将组合成员合并成一个新的政策,在更精细的初始环境中运行一个更精细的精细的精细的精细度政策,在初始的开始的一个环境中,以两种方式进行更精细的趋精细的递化。