On-policy deep reinforcement learning algorithms have low data utilization and require significant experience for policy improvement. This paper proposes a proximal policy optimization algorithm with prioritized trajectory replay (PTR-PPO) that combines on-policy and off-policy methods to improve sampling efficiency by prioritizing the replay of trajectories generated by old policies. We first design three trajectory priorities based on the characteristics of trajectories: the first two being max and mean trajectory priorities based on one-step empirical generalized advantage estimation (GAE) values and the last being reward trajectory priorities based on normalized undiscounted cumulative reward. Then, we incorporate the prioritized trajectory replay into the PPO algorithm, propose a truncated importance weight method to overcome the high variance caused by large importance weights under multistep experience, and design a policy improvement loss function for PPO under off-policy conditions. We evaluate the performance of PTR-PPO in a set of Atari discrete control tasks, achieving state-of-the-art performance. In addition, by analyzing the heatmap of priority changes at various locations in the priority memory during training, we find that memory size and rollout length can have a significant impact on the distribution of trajectory priorities and, hence, on the performance of the algorithm.
翻译:在政策上强化的深层学习算法中,数据利用率低,需要大量政策改进经验。本文件建议采用最接近的政策优化算法,优先进行轨迹重现(PTR-PPPO),结合政策和非政策方法,通过优先播放旧政策产生的轨迹,提高取样效率;我们首先根据轨迹的特点设计三个轨迹优先事项:前两个是最大和中度的轨迹优先事项,其依据是一步骤的经验性普遍优势估计值(GAE),最后一项是根据正常的未分化累积奖励,奖励轨迹优先事项。然后,我们将优先轨迹重现纳入PPPPO算法,提出克服多步经验下因重要重量造成的巨大差异的脱轨重要权重方法,并为PPO设计一个在脱离政策条件下的政策改进损失功能。我们评估PTR-PPO在一系列阿塔里分散控制任务中的绩效,实现最新业绩。此外,我们通过分析各地点在培训期间的优先记忆中优先变化的热度图,我们发现在学习阶段的成绩和演算法的长短上,可以对重大记忆轨迹分布产生重要的影响。