Evaluations of Deep Reinforcement Learning (DRL) methods are an integral part of scientific progress of the field. Beyond designing DRL methods for general intelligence, designing task-specific methods is becoming increasingly prominent for real-world applications. In these settings, the standard evaluation practice involves using a few instances of Markov Decision Processes (MDPs) to represent the task. However, many tasks induce a large family of MDPs owing to variations in the underlying environment, particularly in real-world contexts. For example, in traffic signal control, variations may stem from intersection geometries and traffic flow levels. The select MDP instances may thus inadvertently cause overfitting, lacking the statistical power to draw conclusions about the method's true performance across the family. In this article, we augment DRL evaluations to consider parameterized families of MDPs. We show that in comparison to evaluating DRL methods on select MDP instances, evaluating the MDP family often yields a substantially different relative ranking of methods, casting doubt on what methods should be considered state-of-the-art. We validate this phenomenon in standard control benchmarks and the real-world application of traffic signal control. At the same time, we show that accurately evaluating on an MDP family is nontrivial. Overall, this work identifies new challenges for empirical rigor in reinforcement learning, especially as the outcomes of DRL trickle into downstream decision-making.
翻译:深强化学习方法的评估是该领域科学进步的一个组成部分。除了为一般情报设计DRL方法外,设计任务特定方法对于现实世界应用越来越突出。在这些环境中,标准评估做法涉及使用马可夫决策程序(MDPs)的几例实例来代表任务。然而,由于基础环境的变化,特别是在现实世界背景下,许多任务导致大型的MDP体系,特别是基础环境的变化,在交通信号控制方面,变化可能源于交错的地缘和交通流量水平。因此,选定的MDP案例可能无意地造成配对过度,缺乏统计能力来得出关于该方法在整个家庭中真实表现的结论。在文章中,我们增加了DRL评估,以考虑MDP的参数性家庭。我们表明,与评估DRL方法在特定情况下的差异相比,对MDP家族的排序往往差别很大,对哪些方法应当被视为“最新”的方法产生疑问。我们在标准控制基准中验证了这一现象,对交通信号的实际应用在家庭内部应用,特别是MDR的升级,我们准确地指出,对MDR决定的学习结果进行实地评估。