While deep reinforcement learning has achieved promising results in challenging decision-making tasks, the main bones of its success --- deep neural networks are mostly black-boxes. A feasible way to gain insight into a black-box model is to distill it into an interpretable model such as a decision tree, which consists of if-then rules and is easy to grasp and be verified. However, the traditional model distillation is usually a supervised learning task under a stationary data distribution assumption, which is violated in reinforcement learning. Therefore, a typical policy distillation that clones model behaviors with even a small error could bring a data distribution shift, resulting in an unsatisfied distilled policy model with low fidelity or low performance. In this paper, we propose to address this issue by changing the distillation objective from behavior cloning to maximizing an advantage evaluation. The novel distillation objective maximizes an approximated cumulative reward and focuses more on disastrous behaviors in critical states, which controls the data shift effect. We evaluate our method on several Gym tasks, a commercial fight game, and a self-driving car simulator. The empirical results show that the proposed method can preserve a higher cumulative reward than behavior cloning and learn a more consistent policy to the original one. Moreover, by examining the extracted rules from the distilled decision trees, we demonstrate that the proposed method delivers reasonable and robust decisions.
翻译:虽然深层强化学习在挑战决策任务方面取得了有希望的成果,但其成功的主要骨骼 -- -- 深神经网络大多是黑箱。深入了解黑箱模型的一个可行方法是将它提炼成一个可解释的模式,例如决策树,由当时的规则组成,容易理解和核实。然而,传统模型蒸馏通常是在固定数据分配假设下的一项监督学习任务,这在强化学习中违反了。因此,一个典型的政策蒸馏,即克隆模型行为,即使有小错误,也可能导致数据分配变化,导致不满意的蒸馏政策模型,其忠实性低或低性能。在本文件中,我们提议通过改变行为克隆的蒸馏目标,最大限度地实现优势评价。新颖蒸馏目标通常是一种近似累积的奖励,并更多地侧重于关键国家控制数据转换效应的灾难性行为。我们评估了我们在若干Gym任务上的方法,一个商业斗争游戏,以及一个自我驱动力更高的汽车模拟政策模型,导致不满意的蒸馏型政策模式。在本文件中,我们建议通过一个不断的模型研究的方法,通过一种不断的模型,来显示一种解释和不断研究的方法,来显示一种解释的实验性研究。