Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the practice of evaluating only a small number of runs per task, exacerbating the statistical uncertainty in point estimates. In this paper, we argue that reliable evaluation in the few run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis. With the aim of increasing the field's confidence in reported results with a handful of runs, we advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results. Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used RL benchmarks including the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies in prior comparisons. Our findings call for a change in how we evaluate performance in deep RL, for which we present a more rigorous evaluation methodology, accompanied with an open-source library rliable, to prevent unreliable results from stagnating the field.
翻译:深度强化学习(RL)算法主要通过比较其在大量任务方面的相对业绩来评价。在深度RL基准中,大多数公布的结果都比较了总体业绩的估计点,例如平均和中位数,忽略了使用有限数量培训运行所隐含的统计不确定性。从环球学习环境(ALE)开始,向计算需求基准的转变导致对每个任务只评价少量运行量的做法,加剧了点估计中的统计不确定性。在本文中,我们认为,少数运行深度RL制度的可靠评价不能忽视结果的不确定性,而不产生减缓实地进展的风险。我们用对阿塔里100k基准的案例研究来说明这一点,我们发现单从点估算得出的结论与更彻底的统计分析之间存在很大的差异。为了提高外地对所报告结果的信心,我们主张报告总业绩的间隔估计,并提出业绩概况以说明结果的可变性,以及我们目前更可靠和高效的综合指标,例如深度评估中分数,我们用更精确的RRRR的评分,用一个小的统计工具来进行实地评估,以及我们现有业绩的精确性评估。