The inductive bias of a neural network is largely determined by the architecture and the training algorithm. To achieve good generalization, how to effectively train a neural network is of great importance. We propose a novel orthogonal over-parameterized training (OPT) framework that can provably minimize the hyperspherical energy which characterizes the diversity of neurons on a hypersphere. By maintaining the minimum hyperspherical energy during training, OPT can greatly improve the empirical generalization. Specifically, OPT fixes the randomly initialized weights of the neurons and learns an orthogonal transformation that applies to these neurons. We consider multiple ways to learn such an orthogonal transformation, including unrolling orthogonalization algorithms, applying orthogonal parameterization, and designing orthogonality-preserving gradient descent. For better scalability, we propose the stochastic OPT which performs orthogonal transformation stochastically for partial dimensions of neurons. Interestingly, OPT reveals that learning a proper coordinate system for neurons is crucial to generalization. We provide some insights on why OPT yields better generalization. Extensive experiments validate the superiority of OPT over the standard training.
翻译:神经网络的感知偏差在很大程度上由结构与培训算法决定。 为了实现良好的概括化, 如何有效地培训神经网络非常重要。 我们提出一个新的正统超分度培训框架, 能够将超视球能量最小化, 即高视线神经多样性的特点。 通过在训练期间保持最低的超球能量, 巴勒斯坦被占领土可以大大改进经验的概括化。 具体地说, 方块修正神经的随机初始化重量, 并学习适用于这些神经的正方形变异。 我们考虑多种方法来学习这种正向变, 包括不滚动或超分解算算法, 应用正方形参数化法, 设计偏移梯度梯度下降的超球性能量。 为了更精确性, 我们建议对神经系统的局部尺寸进行随机变异性变。 有趣的是, 方块平面研究神经系统的适当协调系统对于全面化至关重要。 我们提供一些关于正向地平面实验的深刻见解, 以更精确地显示, 地平面试验。