While deep reinforcement learning has proven to be successful in solving control tasks, the "black-box" nature of an agent has received increasing concerns. We propose a prototype-based post-hoc policy explainer, ProtoX, that explains a blackbox agent by prototyping the agent's behaviors into scenarios, each represented by a prototypical state. When learning prototypes, ProtoX considers both visual similarity and scenario similarity. The latter is unique to the reinforcement learning context, since it explains why the same action is taken in visually different states. To teach ProtoX about visual similarity, we pre-train an encoder using contrastive learning via self-supervised learning to recognize states as similar if they occur close together in time and receive the same action from the black-box agent. We then add an isometry layer to allow ProtoX to adapt scenario similarity to the downstream task. ProtoX is trained via imitation learning using behavior cloning, and thus requires no access to the environment or agent. In addition to explanation fidelity, we design different prototype shaping terms in the objective function to encourage better interpretability. We conduct various experiments to test ProtoX. Results show that ProtoX achieved high fidelity to the original black-box agent while providing meaningful and understandable explanations.
翻译:深入强化学习证明在解决控制任务方面是成功的,但一个代理商的“黑盒”性质却日益受到越来越多的关注。我们提议了一个基于原型的热后政策解释器(ProtoX),通过将代理商的行为植入各种情景(每个情景都由原型状态代表)来解释黑盒剂。当学习原型时,ProtoX既考虑视觉相似性,也考虑情景相似性。后者与强化学习环境是独一无二的,因为它解释了为什么在视觉不同的州采取同样的行动。为了教育普罗托克斯了解视觉相似性,我们用自我监督的学习来进行对比性学习,对一个编码器进行预编程,以辨别一个类似状态,即如果它们同时发生类似,并且从黑盒代理商那里得到相同的动作。然后我们添加一个异度层,使ProtoX能够将情景与下游任务相仿。ProtoX通过模拟学习行为克隆进行训练,因此不需要接触环境或代理商。除了解释真实性外,我们还设计了目标功能中不同的原型设计术语,通过自我监督学习来鼓励更好的解释,从而理解性解释。我们所实现的原创性。