Knowledge distillation (KD) has been proven to be a simple and effective tool for training compact models. Almost all KD variants for semantic segmentation align the student and teacher networks' feature maps in the spatial domain, typically by minimizing point-wise and/or pair-wise discrepancy. Observing that in semantic segmentation, some layers' feature activations of each channel tend to encode saliency of scene categories (analogue to class activation mapping), we propose to align features channel-wise between the student and teacher networks. To this end, we first transform the feature map of each channel into a distribution using softmax normalization, and then minimize the Kullback-Leibler (KL) divergence of the corresponding channels of the two networks. By doing so, our method focuses on mimicking the soft distributions of channels between networks. In particular, the KL divergence enables learning to pay more attention to the most salient regions of the channel-wise maps, presumably corresponding to the most useful signals for semantic segmentation. Experiments demonstrate that our channel-wise distillation outperforms almost all existing spatial distillation methods for semantic segmentation considerably, and requires less computational cost during training. We consistently achieve superior performance on three benchmarks with various network structures. Code is available at: https://git.io/ChannelDis
翻译:知识蒸馏( KD) 已被证明是培训紧凑模型的一个简单而有效的工具。 几乎所有用于语义分割的 KD 变量都将学生和教师网络在空间领域的特征地图与学生和教师网络在空间领域的特征地图统一起来, 通常将点偏差和/或对对称差异最小化。 观察在语义分割中, 每个频道的一些层特征启动往往会将场景类别的突出性能( 类比激活绘图对话), 我们提议在学生和教师网络之间对频道特征进行调和。 为此, 我们首先将每个频道的特征地图转换为使用软式马克斯正常化的分布, 然后将两个网络的相应频道的 KL (KL) 差异最小化 。 通过这样做, 我们的方法侧重于模拟网络之间的软化分布。 特别是, KL 差异可以学习更多关注频道地图中最突出的区域( 可能与最有用的语义分割信号相对应 。 实验显示, 我们的频道蒸馏超出所有现有的空间- Leel( KL) 等 模式几乎需要所有现有的高级网络标准 。 We destail disal deal disalation laction 。