The experience replay mechanism allows agents to use the experiences multiple times. In prior works, the sampling probability of the transitions was adjusted according to their importance. Reassigning sampling probabilities for every transition in the replay buffer after each iteration is highly inefficient. Therefore, experience replay prioritization algorithms recalculate the significance of a transition when the corresponding transition is sampled to gain computational efficiency. However, the importance level of the transitions changes dynamically as the policy and the value function of the agent are updated. In addition, experience replay stores the transitions are generated by the previous policies of the agent that may significantly deviate from the most recent policy of the agent. Higher deviation from the most recent policy of the agent leads to more off-policy updates, which is detrimental for the agent. In this paper, we develop a novel algorithm, Batch Prioritizing Experience Replay via KL Divergence (KLPER), which prioritizes batch of transitions rather than directly prioritizing each transition. Moreover, to reduce the off-policyness of the updates, our algorithm selects one batch among a certain number of batches and forces the agent to learn through the batch that is most likely generated by the most recent policy of the agent. We combine our algorithm with Deep Deterministic Policy Gradient and Twin Delayed Deep Deterministic Policy Gradient and evaluate it on various continuous control tasks. KLPER provides promising improvements for deep deterministic continuous control algorithms in terms of sample efficiency, final performance, and stability of the policy during the training.
翻译:经验重放机制允许代理商多次使用经验。 在先前的工程中, 转换的抽样概率根据其重要性进行了调整 。 每次迭代后, 重放缓冲每次过渡中每次过渡的抽样概率都非常低。 因此, 经验重放优先排序算法重新计算了相应过渡抽样时过渡的重要性, 以获得计算效率。 但是, 随着该代理商的政策和价值功能的更新, 过渡的重要性会发生动态变化。 此外, 经验重放过渡的概率是由该代理商以往的政策产生的, 可能大大偏离该代理商的最新政策。 进一步偏离该代理商的最新政策导致更多的政策更新, 这不利于代理商。 在本文中, 我们开发了一种新型算法, 通过 KL DVergence (KLPER) 评估过渡的分批次数, 而不是直接确定每次过渡的顺序。 此外, 为了降低更新的离政策, 我们的算法选择了一定数量的分级法, 并且迫使最有希望的代理商在不断变化的政策上学习。 。 我们最有可能通过分级的不断的政策, 我们的高级的代理商 将不断的策略 学习。