Knowledge distillation uses both real hard labels and soft labels predicted by teacher models as supervision. Intuitively, we expect the soft labels and hard labels to be concordant w.r.t. their orders of probabilities. However, we found critical order violations between hard labels and soft labels in augmented samples. For example, for an augmented sample $x=0.7*panda+0.3*cat$, we expect the order of meaningful soft labels to be $P_\text{soft}(panda|x)>P_\text{soft}(cat|x)>P_\text{soft}(other|x)$. But real soft labels usually violate the order, e.g. $P_\text{soft}(tiger|x)>P_\text{soft}(panda|x)>P_\text{soft}(cat|x)$. We attribute this to the unsatisfactory generalization ability of the teacher, which leads to the prediction error of augmented samples. Empirically, we found the violations are common and injure the knowledge transfer. In this paper, we introduce order restrictions to data augmentation for knowledge distillation, which is denoted as isotonic data augmentation (IDA). We use isotonic regression (IR) -- a classic technique from statistics -- to eliminate the order violations. We show that IDA can be modeled as a tree-structured IR problem. We thereby adapt the classical IRT-BIN algorithm for optimal solutions with $O(c \log c)$ time complexity, where $c$ is the number of labels. In order to further reduce the time complexity, we also propose a GPU-friendly approximation with linear time complexity. We have verified on variant datasets and data augmentation techniques that our proposed IDA algorithms effectively increases the accuracy of knowledge distillation by eliminating the rank violations.
翻译:知识蒸馏既使用真实的硬标签, 也使用教师模型预测的软标签。 直观地说, 我们期望软标签和硬标签是匹配的 w.r.r.t.t. 。 但是, 我们发现硬标签和软标签在增强样本中的软标签之间有严重的违反秩序现象。 例如, 对于强化的样本 $x= 0.7*panda+0. 3* cat$, 我们期望有意义的软标签的顺序是 $P ⁇ text{ soft} (panda_x) >P ⁇ text{ sock} (catex) >P ⁇ text{soft} (other ⁇ x). 但是真正的软标签通常违反秩序, 例如 $P ⁇ text{soft} (tiger) >P}tle{tle{s{soft} (pandax} > pättle{ple{ple{ple{s} we tabred the plationalization lade a deminal deminal detraction a demoal detradeal detraction) distration distration 数据。 数据 数据是显示我们的, 数据, 我们的递化变化变化系统数据, 数据是用来显示数据数据, 我们的递化数据数据, 我们的递化数据是变化数据, 数据, 我们的变化数据数据数据, 我们的变化数据数据是变的变的变的变的变的变的变的变。