While various knowledge distillation (KD) methods in CNN-based detectors show their effectiveness in improving small students, the baselines and recipes for DETR-based detectors are yet to be built. In this paper, we focus on the transformer decoder of DETR-based detectors and explore KD methods for them. The outputs of the transformer decoder lie in random order, which gives no direct correspondence between the predictions of the teacher and the student, thus posing a challenge for knowledge distillation. To this end, we propose MixMatcher to align the decoder outputs of DETR-based teachers and students, which mixes two teacher-student matching strategies, i.e., Adaptive Matching and Fixed Matching. Specifically, Adaptive Matching applies bipartite matching to adaptively match the outputs of the teacher and the student in each decoder layer, while Fixed Matching fixes the correspondence between the outputs of the teacher and the student with the same object queries, with the teacher's fixed object queries fed to the decoder of the student as an auxiliary group. Based on MixMatcher, we build \textbf{D}ecoder \textbf{D}istillation for \textbf{DE}tection \textbf{TR}ansformer (D$^3$ETR), which distills knowledge in decoder predictions and attention maps from the teachers to students. D$^3$ETR shows superior performance on various DETR-based detectors with different backbones. For example, D$^3$ETR improves Conditional DETR-R50-C5 by $\textbf{7.8}/\textbf{2.4}$ mAP under $12/50$ epochs training settings with Conditional DETR-R101-C5 as the teacher.
翻译:以CNN为基础的探测器中的各种知识蒸馏方法(KD) 显示其在改善小学生方面的有效性, 以DETR为基础的探测器的基线和配方尚未建立。 在本文中, 我们侧重于 DTR 探测器的变压器解码器, 并为它们探索 KD 方法。 变压器解码器的输出是随机的, 使教师和学生的预测之间没有直接的对应, 从而对知识蒸馏构成挑战。 为此, 我们建议 MixMatcher 将基于 DETR 的师生和学生的解译器输出相匹配, 后者混合了两个师生匹配的策略, 即适应匹配。 具体地说, 调配对双部分匹配师和每个解码层的学生的输出, 而固定匹配师生的输出与基于 TR. TR. 8} DNA 和 DNA 的解调试器, 在 DNA 中, 我们用 DNA 和 D. D. D. D. D. D.