强化学习的实证设计 (Empirical Design in Reinforcement Learning)

Empirical design in reinforcement learning is no small task. Running good experiments requires attention to detail and at times significant computational resources. While compute resources available per dollar have continued to grow rapidly, so have the scale of typical experiments in reinforcement learning. It is now common to benchmark agents with millions of parameters against dozens of tasks, each using the equivalent of 30 days of experience. The scale of these experiments often conflict with the need for proper statistical evidence, especially when comparing algorithms. Recent studies have highlighted how popular algorithms are sensitive to hyper-parameter settings and implementation details, and that common empirical practice leads to weak statistical evidence (Machado et al., 2018; Henderson et al., 2018). Here we take this one step further. This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning. In particular, we cover: the statistical assumptions underlying common performance measures, how to properly characterize performance variation and stability, hypothesis testing, special considerations for comparing multiple agents, baseline and illustrative example construction, and how to deal with hyper-parameters and experimenter bias. Throughout we highlight common mistakes found in the literature and the statistical consequences of those in example experiments. The objective of this document is to provide answers on how we can use our unprecedented compute to do good science in reinforcement learning, as well as stay alert to potential pitfalls in our empirical design.

翻译：强化学习中的实证设计并非易事。进行良好的实验需要注意细节，并且有时需要大量的计算资源。虽然每美元可用的计算资源不断增长，但强化学习典型实验的规模也在不断增长。现在通常将拥有数百万个参数的代理程序与数十个任务进行基准测试，每个任务使用相当于30天的经验。这些实验的规模经常与正确的统计证据需求冲突，特别是在比较算法时。最近的研究强调了流行算法对超参数设置和实现细节的敏感性，以及常见实证做法导致弱的统计证据（Machado等人，2018；Henderson等人，2018）。我们将其推广透彻地。本文既呼吁行动，也是如何在强化学习中做好实验的全面资源。特别是，我们涵盖了：常见绩效度量的统计假设，如何正确地表征绩效变化和稳定性，假设检验，比较多个代理程序的特殊考虑，基线和举例构建，以及如何处理超参数和实验者偏见。始终强调文献中常见的错误和样例实验中的统计结果。本文件的目标是提供关于我们如何利用前所未有的计算资源进行强化学习的好科学，以及警惕实证设计中的潜在陷阱的答案。