During the training of a reinforcement learning (RL) agent, the distribution of training data is non-stationary as the agent's behavior changes over time. Therefore, there is a risk that the agent is overspecialized to a particular distribution and its performance suffers in the larger picture. Ensemble RL can mitigate this issue by learning a robust policy. However, it suffers from heavy computational resource consumption due to the newly introduced value and policy functions. In this paper, to avoid the notorious resources consumption issue, we design a novel and simple ensemble deep RL framework that integrates multiple models into a single model. Specifically, we propose the \underline{M}inimalist \underline{E}nsemble \underline{P}olicy \underline{G}radient framework (MEPG), which introduces minimalist ensemble consistent Bellman update by utilizing a modified dropout operator. MEPG holds ensemble property by keeping the dropout consistency of both sides of the Bellman equation. Additionally, the dropout operator also increases MEPG's generalization capability. Moreover, we theoretically show that the policy evaluation phase in the MEPG maintains two synchronized deep Gaussian Processes. To verify the MEPG framework's ability to generalize, we perform experiments on the gym simulator, which presents that the MEPG framework outperforms or achieves a similar level of performance as the current state-of-the-art ensemble methods and model-free methods without increasing additional computational resource costs.
翻译:在强化学习(RL)代理器的培训过程中,培训数据的分发是非静止的,因为该代理商的行为随时间变化而变化。因此,存在着该代理商过于专门化于特定分布,其性能在更大范围内受到损害的风险。集合RL可以通过学习强有力的政策来缓解这一问题。然而,由于新引入的价值和政策功能,它由于新引入的价值观和政策功能而导致大量计算资源消耗。在本文件中,为了避免臭名昭著的资源消费问题,我们设计了一个新颖和简单的深层混合RL框架,将多个模型纳入一个单一模型。具体地说,我们建议该代理商过于专门化于某个特定的分布,其性能在更大的范围上受到影响。此外,我们从理论上说,在不使用修改的辍学操作者操作者中,将类似的业绩评估方法提升到总体的MIPGFA框架。我们从深度上展示了一种同步化的流程。