We present the first pixel-level self-supervised distillation framework specified for dense prediction tasks. Our approach, called Pixel-Wise Contrastive Distillation (PCD), distills knowledge by attracting the corresponding pixels from student's and teacher's output feature maps. This pixel-to-pixel distillation demands for maintaining the spatial information of teacher's output. We propose a SpatialAdaptor that adapts the well-trained projection/prediction head of the teacher used to encode vectorized features to processing 2D feature maps. SpatialAdaptor enables more informative pixel-level distillation, yielding a better student for dense prediction tasks. Besides, in light of the inadequate effective receptive fields of small models, we utilize a plug-in multi-head self-attention module to explicitly relate the pixels of student's feature maps. Overall, our PCD outperforms previous self-supervised distillation methods on various dense prediction tasks. A backbone of ResNet-18 distilled by PCD achieves $37.4$ AP$^\text{bbox}$ and $34.0$ AP$^{mask}$ with Mask R-CNN detector on COCO dataset, emerging as the first pre-training method surpassing the supervised pre-trained counterpart.
翻译:我们提出了用于密集预测任务的第一个像素级自我监督的蒸馏框架。 我们的方法叫做像素-Wise Contrastive 蒸馏(PCD),通过吸引学生和教师输出特性地图中的相应像素来蒸馏知识。 这个像素到像素蒸馏模块是为了维护教师输出的空间信息。 我们提议了一个空间适应器, 将教师用于将矢量特性编码为2D特征地图的训练有素的投影/ 灌注头进行调整。 空间适应器能够让更多的信息像素级蒸馏(PCD), 产生更好的学生用于密集预测任务。 此外,鉴于小型模型的有效可接收领域不足, 我们使用一个插插接多头的自我意识模块来明确连接学生产出的空间信息像素。 总之, 我们的PCD超越了先前在各种密度预测任务中自封的蒸馏方法。 ResNet-18 由 PCD 所蒸馏的一个骨架, 美元 AP- $ 37.4美元, AP- main- sestrodustrain arestrual prestrain AS agroduction.