Compared with the feature-based distillation methods, logits distillation can liberalize the requirements of consistent feature dimension between teacher and student networks, while the performance is deemed inferior in face recognition. One major challenge is that the light-weight student network has difficulty fitting the target logits due to its low model capacity, which is attributed to the significant number of identities in face recognition. Therefore, we seek to probe the target logits to extract the primary knowledge related to face identity, and discard the others, to make the distillation more achievable for the student network. Specifically, there is a tail group with near-zero values in the prediction, containing minor knowledge for distillation. To provide a clear perspective of its impact, we first partition the logits into two groups, i.e., Primary Group and Secondary Group, according to the cumulative probability of the softened prediction. Then, we reorganize the Knowledge Distillation (KD) loss of grouped logits into three parts, i.e., Primary-KD, Secondary-KD, and Binary-KD. Primary-KD refers to distilling the primary knowledge from the teacher, Secondary-KD aims to refine minor knowledge but increases the difficulty of distillation, and Binary-KD ensures the consistency of knowledge distribution between teacher and student. We experimentally found that (1) Primary-KD and Binary-KD are indispensable for KD, and (2) Secondary-KD is the culprit restricting KD at the bottleneck. Therefore, we propose a Grouped Knowledge Distillation (GKD) that retains the Primary-KD and Binary-KD but omits Secondary-KD in the ultimate KD loss calculation. Extensive experimental results on popular face recognition benchmarks demonstrate the superiority of proposed GKD over state-of-the-art methods.
翻译:与基于特征的蒸馏方法相比,逻辑蒸馏可以使教师和学生网络之间的一致特征尺寸要求更宽松,但在人脸识别中,其性能被认为是次优的。一个主要的挑战是,由于人脸识别中有大量身份证,轻量级的学生网络很难拟合目标 logits,这归因于其低模型容量。因此,我们试图探测目标 logits,以提取与人脸身份相关的主要知识,并丢弃其他知识,使蒸馏更容易为学生网络所实现。具体而言,存在一个在预测中接近零值的尾组,包含蒸馏的小知识。为了提供其影响的清晰视角,我们首先根据软化预测的累积概率将 logits 划分为两组,即主组和次要组。然后,我们重新组织分组 logits 的知识蒸馏(KD)损失,将其分为三个部分,即主 KD、次要 KD 和二进制 KD。主 KD 指从教师蒸馏出主要知识,次要 KD 旨在提炼次要知识,但增加了蒸馏的难度,二进制 KD 确保了教师和学生之间的知识分布一致性。我们实验发现(1)主 KD 和二进制 KD 对 KD 不可缺少,(2)次要 KD 是瓶颈制约 KD 的罪魁祸首。因此,我们提出了一种分组知识蒸馏(GKD),在最终的 KD 损失计算中保留了主 KD 和二进制 KD,但省略了次要 KD。对流行的人脸识别基准进行广泛的实验结果表明,所提出的 GKD 优于现有的最先进方法。