Ensemble reinforcement learning (RL) aims to mitigate instability in Q-learning and to learn a robust policy, which introduces multiple value and policy functions. In this paper, we consider finding a novel but simple ensemble Deep RL algorithm to solve the resource consumption issue. Specifically, we consider integrating multiple models into a single model. To this end, we propose the \underline{M}inimalist \underline{E}nsemble \underline{P}olicy \underline{G}radient framework (MEPG), which introduces minimalist ensemble consistent Bellman update. And we find one value network is sufficient in our framework. Moreover, we theoretically show that the policy evaluation phase in the MEPG is mathematically equivalent to a deep Gaussian Process. To verify the effectiveness of the MEPG framework, we conduct experiments on the gym simulator, which show that the MEPG framework matches or outperforms the state-of-the-art ensemble methods and model-free methods without additional computational resource costs.
翻译:集合强化学习( RL) 旨在减轻Q- 学习中的不稳定性, 并学习一个强有力的政策, 引入多重价值和政策功能。 在本文中, 我们考虑找到一个新颖而简单的混合深RL算法来解决资源消耗问题。 具体地说, 我们考虑将多个模型整合到一个单一模型中。 为此, 我们提议了“ 下线 { M} 内线 { 内线 { 下线 } 内线 { P} 外线 { G} 辐射框架 (MEPG), 该框架引入了最小的组合一致的贝尔曼更新。 我们发现一个值网络在我们的框架中已经足够。 此外, 我们理论上显示, MEPG 的政策评价阶段在数学上等同于深高音进程。 为了验证MEPG 框架的有效性, 我们在健身房模拟器上进行实验, 这表明 MEPG 框架在不增加计算资源成本的情况下, 匹配或超越了状态的组合方法和模式。