Minimizing cross-entropy over the softmax scores of a linear map composed with a high-capacity encoder is arguably the most popular choice for training neural networks on supervised learning tasks. However, recent works show that one can directly optimize the encoder instead, to obtain equally (or even more) discriminative representations via a supervised variant of a contrastive objective. In this work, we address the question whether there are fundamental differences in the sought-for representation geometry in the output space of the encoder at minimal loss. Specifically, we prove, under mild assumptions, that both losses attain their minimum once the representations of each class collapse to the vertices of a regular simplex, inscribed in a hypersphere. We provide empirical evidence that this configuration is attained in practice and that reaching a close-to-optimal state typically indicates good generalization performance. Yet, the two losses show remarkably different optimization behavior. The number of iterations required to perfectly fit to data scales superlinearly with the amount of randomly flipped labels for the supervised contrastive loss. This is in contrast to the approximately linear scaling previously reported for networks trained with cross-entropy.
翻译:将由高容量编码器组成的线性地图的柔性分数最小化的交叉体积分数最小化,可以说是培训神经网络进行监管学习任务的最受欢迎的选择。然而,最近的工程表明,我们可以直接优化编码器,而代之以直接优化编码器,通过一个监督的对比目标变体获得平等(甚至更多)的区别性表述。在这项工作中,我们处理的一个问题是,在所寻求的对编码器输出空间代表的几何测量方法中,在最小损失的情况下,是否存在着根本的差别。具体地说,根据温和的假设,我们证明,在将每一类的损耗表现到一个固定的简单符号的顶部时,两种损失都达到了最低程度。我们提供了经验证据,证明这种配置在实践中已经实现,而且达到接近最佳的状态通常表明良好的概括性表现。然而,两种损失显示了截然不同的优化行为。为了完全适合数据比例的超直线与所监督对比损失的随机翻动标签数量所需要的迭代数。这与以前报告的经过交叉式训练的网络的大致线性缩缩缩缩图不同。