The problem of optimization on Stiefel manifold, i.e., minimizing functions of (not necessarily square) matrices that satisfy orthogonality constraints, has been extensively studied, partly due to rich machine learning applications. Yet, a new approach is proposed based on, for the first time, an interplay between thoughtfully designed continuous and discrete dynamics. It leads to a gradient-based optimizer with intrinsically added momentum. This method exactly preserves the manifold structure but does not require commonly used projection or retraction, and thus having low computational costs when compared to existing algorithms. Its generalization to adaptive learning rates is also demonstrated. Pleasant performances are observed in various practical tasks. For instance, we discover that placing orthogonal constraints on attention heads of trained-from-scratch Vision Transformer [Dosovitskiy et al. 2022] could remarkably improve its performance, when our optimizer is used, and it is better that each head is made orthogonal within itself but not necessarily to other heads. This optimizer also makes the useful notion of Projection Robust Wasserstein Distance [Paty & Cuturi 2019][Lin et al. 2020] for high-dim. optimal transport even more effective.
翻译:Stiefel 元件上的优化问题,即将满足正方形限制的矩阵功能(不一定平方)最小化问题,已经进行了广泛研究,部分是由于机器学习应用的丰富。然而,提出了一种新办法,其依据是思想周密设计的连续动态和离散动态之间的相互作用。它导致一个基于梯度的优化,并具有内在增加动力。这种方法完全保留了多元结构,但并不要求常用的投影或撤回,因此与现有算法相比计算成本较低。它与适应性学习率的通用化也得到了证明。在各种实际工作中观察到了优异性性性性能。例如,我们发现,对受过训练的远端视觉变异器[Dosovitskiy 和Al. 2022] 的注意力头部设置正反向约束,在使用优化器时可以显著地改进其性能,而且每个头部在自己内部或直方位,但不一定与其他头部相比。这种优化也使Progust Waslerstein距离[Paty & Cuturi 2019][Lin and al al al) 的实用概念在2020年高度上更为有效。