Knowledge distillation between machine learning models has opened many new avenues for parameter count reduction, performance improvements, or amortizing training time when changing architectures between the teacher and student network. In the case of reinforcement learning, this technique has also been applied to distill teacher policies to students. Until now, policy distillation required access to a simulator or real world trajectories. In this paper we introduce a simulator-free approach to knowledge distillation in the context of reinforcement learning. A key challenge is having the student learn the multiplicity of cases that correspond to a given action. While prior work has shown that data-free knowledge distillation is possible with supervised learning models by generating synthetic examples, these approaches to are vulnerable to only producing a single prototype example for each class. We propose an extension to explicitly handle multiple observations per output class that seeks to find as many exemplars as possible for a given output class by reinitializing our data generator and making use of an adversarial loss. To the best of our knowledge, this is the first demonstration of simulator-free knowledge distillation between a teacher and a student policy. This new approach improves over the state of the art on data-free learning of student networks on benchmark datasets (MNIST, Fashion-MNIST, CIFAR-10), and we also demonstrate that it specifically tackles issues with multiple input modes. We also identify open problems when distilling agents trained in high dimensional environments such as Pong, Breakout, or Seaquest.
翻译:机器学习模型之间的知识蒸馏为在教师和学生网络之间改变结构时减少参数计数、改进业绩或摊销培训时间开辟了许多新的途径。 在强化学习方面,这种技术还被用于向学生蒸馏师资政策。 到目前为止,政策蒸馏需要使用模拟器或真实的世界轨迹。 在本文件中,我们引入了一种在强化学习背景下不使用模拟器进行知识蒸馏的方法。一个关键的挑战是如何让学生学习符合特定行动的多种案例。虽然先前的工作表明,通过生成合成实例,通过监管学习模型,可以实现无数据知识蒸馏。这些方法很容易被应用到为每个类学生制作单一的原型示例。我们提议扩展,以明确处理每个产出类的多项观察,以尽可能多的Explator方式在强化学习数据生成器中找到,并使用对抗性损失的方法。对于我们的知识而言,这是首次展示了模拟者免费知识蒸馏,在师资和学生学习的高级平台上,从而明确展示了我们所学的高级智能的海路路系,从而改进了我们学习的海路系数据模式。