Reinforcement learning (RL) agents are commonly evaluated via their expected value over a distribution of test scenarios. Unfortunately, this evaluation approach provides limited evidence for post-deployment generalization beyond the test distribution. In this paper, we address this limitation by extending the recent CheckList testing methodology from natural language processing to planning-based RL. Specifically, we consider testing RL agents that make decisions via online tree search using a learned transition model and value function. The key idea is to improve the assessment of future performance via a CheckList approach for exploring and assessing the agent's inferences during tree search. The approach provides the user with an interface and general query-rule mechanism for identifying potential inference flaws and validating expected inference invariances. We present a user study involving knowledgeable AI researchers using the approach to evaluate an agent trained to play a complex real-time strategy game. The results show the approach is effective in allowing users to identify previously-unknown flaws in the agent's reasoning. In addition, our analysis provides insight into how AI experts use this type of testing approach, which may help improve future instantiations.
翻译:强化学习(RL)代理商通常根据其对测试情景分布的预期价值进行评估。不幸的是,这种评估方法为部署后除测试分布之外的一般化提供了有限的证据。在本文中,我们通过将最近的核对列表测试方法从自然语言处理扩大到基于规划的RL来解决这一局限性。具体地说,我们考虑使用一个学习的过渡模式和价值功能,测试通过在线树搜索作出决定的RL代理商。关键的想法是通过核对列表方法改进对未来业绩的评估,以探索和评估该代理商在树搜索过程中的推论。这种方法为用户提供了一个界面和一般查询规则机制,用以查明潜在的推论缺陷和验证预期的推论。我们介绍了一项用户研究,让有知识的AI研究人员参与其中,使用这一方法来评价受过训练的代理人如何玩复杂的实时战略游戏。结果显示这种方法有效,使用户能够查明该代理商在推理中以前不为人知的缺陷。此外,我们的分析提供了对大赦国际专家如何使用这种测试方法的深入了解,这可能有助于改进未来的即时。