Visualizing optimization landscapes has led to many fundamental insights in numeric optimization, and novel improvements to optimization techniques. However, visualizations of the objective that reinforcement learning optimizes (the "reward surface") have only ever been generated for a small number of narrow contexts. This work presents reward surfaces and related visualizations of 27 of the most widely used reinforcement learning environments in Gym for the first time. We also explore reward surfaces in the policy gradient direction and show for the first time that many popular reinforcement learning environments have frequent "cliffs" (sudden large drops in expected return). We demonstrate that A2C often "dives off" these cliffs into low reward regions of the parameter space while PPO avoids them, confirming a popular intuition for PPO's improved performance over previous methods. We additionally introduce a highly extensible library that allows researchers to easily generate these visualizations in the future. Our findings provide new intuition to explain the successes and failures of modern RL methods, and our visualizations concretely characterize several failure modes of reinforcement learning agents in novel ways.
翻译:可视化优化景观在数字优化和优化技术的新改进方面产生了许多根本的洞察力。然而,强化学习优化(“回报表面”)的目标的视觉化仅仅在少数狭窄环境中产生过。 这项工作首次展示了Gym27个最广泛使用的强化学习环境的奖励表层和相关视觉化。 我们还探索了政策梯度方向的奖励表层,并首次显示许多广受欢迎的强化学习环境经常出现“裂痕”(在预期回报中出现大量下降 ) 。 我们展示了A2C 常常“跳出”这些悬崖进入参数空间的低奖励区,而PPPO则避免了这些悬崖,从而证实了PPO比以往方法更好的表现的流行直觉。 我们还引入了一个高度普及的图书馆,使研究人员能够方便地在未来生成这些视觉化。 我们的研究结果提供了新的直觉来解释现代RL方法的成功和失败,而我们的视觉化则以新方式具体地说明了加强学习代理人的失败模式。