The objective of many real-world tasks is complex and difficult to procedurally specify. This makes it necessary to use reward or imitation learning algorithms to infer a reward or policy directly from human data. Existing benchmarks for these algorithms focus on realism, testing in complex environments. Unfortunately, these benchmarks are slow, unreliable and cannot isolate failures. As a complementary approach, we develop a suite of simple diagnostic tasks that test individual facets of algorithm performance in isolation. We evaluate a range of common reward and imitation learning algorithms on our tasks. Our results confirm that algorithm performance is highly sensitive to implementation details. Moreover, in a case-study into a popular preference-based reward learning implementation, we illustrate how the suite can pinpoint design flaws and rapidly evaluate candidate solutions. The environments are available at https://github.com/HumanCompatibleAI/seals .
翻译:许多现实世界任务的目标复杂,难以在程序上说明,因此,有必要使用奖励或仿照学习算法直接从人类数据中推断奖励或政策。这些算法的现有基准侧重于现实主义,在复杂环境中进行测试。不幸的是,这些基准缓慢、不可靠,无法孤立失败。作为一种补充方法,我们开发一套简单的诊断任务,以单独测试算法表现的个别方面。我们评估了我们任务的一系列共同的奖励和仿照学习算法。我们的结果证实算法的绩效对执行细节非常敏感。此外,在对基于普遍偏好的奖励学习实施进行案例研究时,我们说明了这套算法如何确定设计缺陷和快速评估候选解决方案。环境见https://gitub.com/HumanCompatableAI/seals。