优先级别重放 (Prioritized Level Replay)

Simulated environments with procedurally generated content have become popular benchmarks for testing systematic generalization of reinforcement learning agents. Every level in such an environment is algorithmically created, thereby exhibiting a unique configuration of underlying factors of variation, such as layout, positions of entities, asset appearances, or even the rules governing environment transitions. Fixed sets of training levels can be determined to aid comparison and reproducibility, and test levels can be held out to evaluate the generalization and robustness of agents. While prior work samples training levels in a direct way (e.g. uniformly) for the agent to learn from, we investigate the hypothesis that different levels provide different learning progress for an agent at specific times during training. We introduce Prioritized Level Replay, a general framework for estimating the future learning potential of a level given the current state of the agent's policy. We find that temporal-difference (TD) errors, while previously used to selectively sample past transitions, also prove effective for scoring a level's future learning potential when the agent replays (that is, revisits) that level to generate entirely new episodes of experiences from it. We report significantly improved sample-efficiency and generalization on the majority of Procgen Benchmark environments as well as two challenging MiniGrid environments. Lastly, we present a qualitative analysis showing that Prioritized Level Replay induces an implicit curriculum, taking the agent gradually from easier to harder levels.

翻译：具有程序生成内容的模拟环境已成为测试强化学习机构系统性全面化的流行基准。这种环境中的每个级别都是有逻辑性的,从而展示了一种独特的变化基本因素配置,例如布局、实体职位、资产外观,甚至环境过渡规则。固定的培训级别可以确定以帮助比较和再生,测试级别可以用来评估代理人的一般化和稳健性。以前的工作样本培训水平可以用来直接(例如统一)让代理人从中学习,而以前的工作样本培训水平则用来直接衡量(例如统一),但我们调查了不同级别为代理人在特定培训期间提供不同的学习进展的假设。我们引入了优先级别重现,这是根据代理人政策的现状来估计某一级别未来学习潜力的一般框架。我们发现,时间差异(TD)的错误虽然以前用来选择性地抽样评估过去的过渡,但在代理人重现(即重新审视)这一级别从而从该级别产生全新经验时,也证明该级别的未来学习潜力是有效的。我们报告说,从当前最难的级别上改进了样本效率和质量水平,这是我们目前最难的初级级级级级的逐步展示。我们开始一个较难的、最难的、最高级级级级级级环境。