Knowledge distillation (KD) is essentially a process of transferring a teacher model's behavior, e.g., network response, to a student model. The network response serves as additional supervision to formulate the machine domain, which uses the data collected from the human domain as a transfer set. Traditional KD methods hold an underlying assumption that the data collected in both human domain and machine domain are both independent and identically distributed (IID). We point out that this naive assumption is unrealistic and there is indeed a transfer gap between the two domains. Although the gap offers the student model external knowledge from the machine domain, the imbalanced teacher knowledge would make us incorrectly estimate how much to transfer from teacher to student per sample on the non-IID transfer set. To tackle this challenge, we propose Inverse Probability Weighting Distillation (IPWD) that estimates the propensity score of a training sample belonging to the machine domain, and assigns its inverse amount to compensate for under-represented samples. Experiments on CIFAR-100 and ImageNet demonstrate the effectiveness of IPWD for both two-stage distillation and one-stage self-distillation.
翻译:知识蒸馏(KD)基本上是将教师模型的行为(例如网络反应)转换到学生模型的过程。网络反应是设计机器域的额外监督,它使用从人类领域收集的数据作为传输集。传统KD方法的基本假设是,在人类领域和机器领域收集的数据都是独立和完全分布的(IID)。我们指出,这一天真假设是不现实的,确实存在两个领域的转移差距。虽然这一差距提供了机器领域的学生模型外部知识,但不平衡的教师知识使我们错误地估计了非IID转移集每个样本从教师向学生转移多少。为了应对这一挑战,我们建议“不易感知蒸馏(IPWD)”估算属于机器领域的培训样品的灵敏度,并为代表不足的样本提供相反的补偿。CIFAR-100和图像网实验显示IPWD对两阶段蒸馏和一阶段自我蒸馏的有效性。