Consistently checking the statistical significance of experimental results is the first mandatory step towards reproducible science. This paper presents a hitchhiker's guide to rigorous comparisons of reinforcement learning algorithms. After introducing the concepts of statistical testing, we review the relevant statistical tests and compare them empirically in terms of false positive rate and statistical power as a function of the sample size (number of seeds) and effect size. We further investigate the robustness of these tests to violations of the most common hypotheses (normal distributions, same distributions, equal variances). Beside simulations, we compare empirical distributions obtained by running Soft-Actor Critic and Twin-Delayed Deep Deterministic Policy Gradient on Half-Cheetah. We conclude by providing guidelines and code to perform rigorous comparisons of RL algorithm performances.
翻译:一致检查实验结果的统计意义是朝向可复制科学迈出的第一个强制性步骤。本文件是搭便车者对强化学习算法进行严格比较的指南。在引入统计测试概念之后,我们审查相关的统计测试,并根据抽样规模(种子数量)和效应大小的函数,从假正率和统计力量的角度对相关统计测试进行经验比较。我们进一步调查这些测试对违反最常见的假设(正常分布、相同分布、相等差异)的稳健性能。侧面模拟,我们比较了在半希他州运行软体-Actor Critic 和双排深层确定性政策分级法获得的经验分布。我们最后通过提供指南和准则,对RL算法的性能进行严格的比较。