The problem of optimization on Stiefel manifold, i.e., minimizing functions of (not necessarily square) matrices that satisfy orthogonality constraints, has been extensively studied. Yet, a new approach is proposed based on, for the first time, an interplay between thoughtfully designed continuous and discrete dynamics. It leads to a gradient-based optimizer with intrinsically added momentum. This method exactly preserves the manifold structure but does not require additional operation to keep momentum in the changing (co)tangent space, and thus has low computational cost and pleasant accuracy. Its generalization to adaptive learning rates is also demonstrated. Notable performances are observed in practical tasks. For instance, we found that placing orthogonal constraints on attention heads of trained-from-scratch Vision Transformer [Dosovitskiy et al. 2022] could markedly improve its performance, when our optimizer is used, and it is better that each head is made orthogonal within itself but not necessarily to other heads. This optimizer also makes the useful notion of Projection Robust Wasserstein Distance [Paty & Cuturi 2019; Lin et al. 2020] for high-dim. optimal transport even more effective.
翻译:Stiefel 元件上的优化问题,即将满足正方形限制的矩阵功能(不一定平方)最小化问题,已经进行了广泛研究。然而,首次根据思维周密设计的连续动态和离散动态之间的相互作用,提出了一种新的方法。它导致一个基于梯度的优化器,并具有内在的增加动力。这种方法确切地保持了多元结构,但不需要额外操作来保持变化中(相差空间)的动力,因此计算成本和准确性较低。它一般化为适应性学习率也得到了证明。在实际任务中也观察到了值得注意的绩效。例如,我们发现,对训练有素的来自斯克拉奇愿景变异器[Dosovitskiy 和al.2022] 的负责人的注意力设置矩形限制可以明显地改善其性能,当我们的优化器被使用时,每个头部本身就具有正向性,但不一定与其他头部有交错。这种优化还使得Prophion Robust Valterstein距离[甚至Paty 2019;Lin et al.2020] 的实用概念对高度更为有效。</s>