We investigate a deep reinforcement learning (RL) architecture that supports explaining why a learned agent prefers one action over another. The key idea is to learn action-values that are directly represented via human-understandable properties of expected futures. This is realized via the embedded self-prediction (ESP)model, which learns said properties in terms of human provided features. Action preferences can then be explained by contrasting the future properties predicted for each action. To address cases where there are a large number of features, we develop a novel method for computing minimal sufficient explanations from anESP. Our case studies in three domains, including a complex strategy game, show that ESP models can be effectively learned and support insightful explanations.
翻译:我们调查了一个深层强化学习(RL)架构,该架构有助于解释为何学习者偏爱一个动作而不是另一个动作。关键理念是学习通过人类可理解的预期未来特性直接体现的行动价值。这是通过嵌入的自我预测(ESP)模型实现的,该模型以人类提供的特性来学习上述属性。然后,可以通过对比每次行动预测的未来属性来解释行动偏好。为了处理有大量特征的案例,我们开发了一种新颖的方法来计算来自ESP的最低限度充分解释。我们在三个领域的案例研究,包括复杂的战略游戏,表明ESP模型可以有效学习和支持有洞察力的解释。