Distillation with unlabeled examples is a popular and powerful method for training deep neural networks in settings where the amount of labeled data is limited: A large ''teacher'' neural network is trained on the labeled data available, and then it is used to generate labels on an unlabeled dataset (typically much larger in size). These labels are then utilized to train the smaller ''student'' model which will actually be deployed. Naturally, the success of the approach depends on the quality of the teacher's labels, since the student could be confused if trained on inaccurate data. This paper proposes a principled approach for addressing this issue based on a ''debiasing'' reweighting of the student's loss function tailored to the distillation training paradigm. Our method is hyper-parameter free, data-agnostic, and simple to implement. We demonstrate significant improvements on popular academic datasets and we accompany our results with a theoretical analysis which rigorously justifies the performance of our method in certain settings.
翻译:在标签数据数量有限的情况下,用未贴标签的例子进行蒸馏是培训深神经网络的流行和有力的方法:大型的“教师”神经网络就现有的标签数据进行了培训,然后用于在未贴标签的数据集上制作标签(通常大得多),然后这些标签被用来培训实际将使用的较小“学生”模型。自然,这种方法的成功取决于教师标签的质量,因为学生如果接受不准确数据的培训,可能会混淆。本文建议了解决这一问题的原则性方法,该方法以“不偏向性”为基础,根据蒸馏培训模式对学生损失功能进行重新加权。我们的方法是无超参数的,数据敏感和简单的执行。我们展示了流行的学术数据集方面的重大改进,我们伴随我们的结果进行理论分析,严格地证明我们在某些环境中的方法绩效。