In order for agents trained by deep reinforcement learning to work alongside humans in realistic settings, we will need to ensure that the agents are \emph{robust}. Since the real world is very diverse, and human behavior often changes in response to agent deployment, the agent will likely encounter novel situations that have never been seen during training. This results in an evaluation challenge: if we cannot rely on the average training or validation reward as a metric, then how can we effectively evaluate robustness? We take inspiration from the practice of \emph{unit testing} in software engineering. Specifically, we suggest that when designing AI agents that collaborate with humans, designers should search for potential edge cases in \emph{possible partner behavior} and \emph{possible states encountered}, and write tests which check that the behavior of the agent in these edge cases is reasonable. We apply this methodology to build a suite of unit tests for the Overcooked-AI environment, and use this test suite to evaluate three proposals for improving robustness. We find that the test suite provides significant insight into the effects of these proposals that were generally not revealed by looking solely at the average validation reward.
翻译:为了让受过深强化学习培训的代理人员在现实环境中与人类一起工作,我们需要确保这些代理人员是 \ emph{ robust} 。 由于真实的世界非常多样化,人类行为也经常随着代理人员的部署而变化,因此该代理人员可能会遇到在培训期间从未见过的新情况。这导致了一项评估挑战:如果我们不能以平均培训或鉴定奖励作为衡量标准,那么我们如何有效地评估强健性?我们从软件工程中的 \ emph{ 单位测试做法中汲取灵感。 具体地说,我们建议设计者在设计与人类合作的AI 代理人员时,应当寻找在\ emph{ 可能的合作伙伴行为} 和\ emph{ 可能遇到的状态中的潜在边缘案例,并写一些测试,以检查这些边缘情况下的代理人员的行为是否合理。 我们使用这种方法来建立一套超常的AI 环境的单位测试, 并使用这个测试套来评估三项建议, 改进强健性。 我们发现测试套设备提供了对这些建议的影响的深刻的洞察力。 我们发现, 测试套设备能够仅仅通过只看普通的验证结果来揭示这些提议的效果。