We leverage probabilistic models of neural representations to investigate how residual networks fit classes. To this end, we estimate class-conditional density models for representations learned by deep ResNets. We then use these models to characterize distributions of representations across learned classes. Surprisingly, we find that classes in the investigated models are not fitted in an uniform way. On the contrary: we uncover two groups of classes that are fitted with markedly different distributions of representations. These distinct modes of class-fitting are evident only in the deeper layers of the investigated models, indicating that they are not related to low-level image features. We show that the uncovered structure in neural representations correlate with memorization of training examples and adversarial robustness. Finally, we compare class-conditional distributions of neural representations between memorized and typical examples. This allows us to uncover where in the network structure class labels arise for memorized and standard inputs.
翻译:我们利用神经表征的概率模型来调查剩余网络如何适合分类。 为此,我们估计深ResNets所学的分类条件密度模型。 然后我们用这些模型来描述各学习类的分布。 令人惊讶的是, 我们发现所调查的模型中的分类没有以统一的方式适应。 相反, 我们发现两组类别具有明显不同的表达分布。 这些不同的分类配制模式只在所调查模型的更深层中才明显可见, 表明它们与低水平图像特征无关。 我们显示神经表征中发现的结构与培训实例的记忆化和对抗性强力相关。 最后, 我们比较了中等和典型例子之间的神经表征的分类条件分布。 这使我们能够发现在网络结构中,记忆化和标准输入的分类标签在网络结构中在哪里产生。