In this work, we explore the maximum-margin bias of quasi-homogeneous neural networks trained with gradient flow on an exponential loss and past a point of separability. We introduce the class of quasi-homogeneous models, which is expressive enough to describe nearly all neural networks with homogeneous activations, even those with biases, residual connections, and normalization layers, while structured enough to enable geometric analysis of its gradient dynamics. Using this analysis, we generalize the existing results of maximum-margin bias for homogeneous networks to this richer class of models. We find that gradient flow implicitly favors a subset of the parameters, unlike in the case of a homogeneous model where all parameters are treated equally. We demonstrate through simple examples how this strong favoritism toward minimizing an asymmetric norm can degrade the robustness of quasi-homogeneous models. On the other hand, we conjecture that this norm-minimization discards, when possible, unnecessary higher-order parameters, reducing the model to a sparser parameterization. Lastly, by applying our theorem to sufficiently expressive neural networks with normalization layers, we reveal a universal mechanism behind the empirical phenomenon of Neural Collapse.
翻译:在这项工作中,我们探索了通过梯度流动而培训的准同源神经网络在指数性损失和过去一个分离点上的最大偏差偏差。我们引入了半同源模型,这种模型的表达性足以描述几乎所有具有同质激活的神经网络,即使是那些带有偏差、剩余连接和正常化层的神经网络,同时其结构足以进行其梯度动态的几何分析。我们利用这一分析,将同类网络的最大偏差偏差的现有结果推广到这一更富的模型类别。我们发现,梯度流动暗含着对一组参数的偏爱,这与所有参数都得到同等对待的同质模型不同。我们通过简单的例子展示了这种有利于尽量减少不对称规范的强烈偏好能够降低准同源模型的稳健性。另一方面,我们推测,这种规范-最小化抛弃,在可能时,不必要地较高顺序参数,将模型降低为稀薄的参数。最后,我们通过将我们的理论应用到足够明确的、具有正常化层次的神经网络,我们揭示了一个普遍的经验机制。