We provide the first global optimization landscape analysis of $Neural\;Collapse$ -- an intriguing empirical phenomenon that arises in the last-layer classifiers and features of neural networks during the terminal phase of training. As recently reported by Papyan et al., this phenomenon implies that ($i$) the class means and the last-layer classifiers all collapse to the vertices of a Simplex Equiangular Tight Frame (ETF) up to scaling, and ($ii$) cross-example within-class variability of last-layer activations collapses to zero. We study the problem based on a simplified $unconstrained\;feature\;model$, which isolates the topmost layers from the classifier of the neural network. In this context, we show that the classical cross-entropy loss with weight decay has a benign global landscape, in the sense that the only global minimizers are the Simplex ETFs while all other critical points are strict saddles whose Hessian exhibit negative curvature directions. In contrast to existing landscape analysis for deep neural networks which is often disconnected from practice, our analysis of the simplified model not only does it explain what kind of features are learned in the last layer, but it also shows why they can be efficiently optimized in the simplified settings, matching the empirical observations in practical deep network architectures. These findings could have profound implications for optimization, generalization, and robustness of broad interests. For example, our experiments demonstrate that one may set the feature dimension equal to the number of classes and fix the last-layer classifier to be a Simplex ETF for network training, which reduces memory cost by over $20\%$ on ResNet18 without sacrificing the generalization performance.
翻译:我们提供了第一个全球优化景观分析 $Neural\ ; 折叠美元 -- -- 这是在培训末期阶段最后一级分类和神经网络特征中出现的令人感兴趣的实验现象。 正如Papyan等人最近所报告的, 这种现象意味着(美元) 类和最后一级分类方法都崩溃到一个简单度等宽度框架(ETF)的顶端, 而(二) 类内跨级变异性在上层启动级升至零级。 我们根据一个简化的 $uncontrated\ ; 直线特性\; 模型$, 将最上层从神经网络的分类中分离出来。 在这方面, 我们表明, 典型的跨翼损失加上重量变色度框架(Etright Fram) 的顶端, 从这个意义上说, 唯一的全球最小值是简单度 ETFTF, 而对于所有其他关键点来说, Hesian 的严格级变形向下曲线方向向零。 与现有的深层内变色网络分析相比, 并不是降低一个深度的内层观测网络的底色网络分析, 它通常与精度的缩缩缩缩缩缩缩缩的内, 也解释了分析显示我们最精度结构的精度。