In order for humans to confidently decide where to employ RL agents for real-world tasks, a human developer must validate that the agent will perform well at test-time. Some policy interpretability methods facilitate this by capturing the policy's decision making in a set of agent rollouts. However, even the most informative trajectories of training time behavior may give little insight into the agent's behavior out of distribution. In contrast, our method conveys how the agent performs under distribution shifts by showing the agent's behavior across a wider trajectory distribution. We generate these trajectories by guiding the agent to more diverse unseen states and showing the agent's behavior there. In a user study, we demonstrate that our method enables users to score better than baseline methods on one of two agent validation tasks.
翻译:为了让人类有信心地决定如何为现实世界任务雇用RL代理商, 人类开发者必须验证该代理商在测试时表现良好。 某些政策解释方法通过在一系列代理商推出时捕捉该政策的决策来便利这一点。 但是,即使培训时间行为中最丰富的信息轨迹也几乎无法了解该代理商在分布过程中的行为。 相反, 我们的方法通过显示该代理商在更广泛的轨迹分布中的行为来传达该代理商如何在分销转移中的行为。 我们通过引导该代理商到更多样化的隐形国家并展示该代理商在那里的行为来生成这些轨迹。 在一项用户研究中,我们证明我们的方法使用户能够在两种代理商验证任务中取得比基准方法更好的成绩。