Memory-efficient optimization is critical for training increasingly large language models (LLMs). A popular strategy involves gradient low-rank projection, storing only the projected optimizer states, with GaLore being a representative example. However, a significant drawback of many such methods is their lack of convergence guarantees, as various low-rank projection approaches introduce inherent biases relative to the original optimization algorithms, which contribute to performance gaps compared to full-parameter training. Aiming to tackle this problem, this paper investigates the layerwise sampling technique for debiasing low-rank projection mechanisms. In particular, an instantiation of the paradigm gives rise to a novel and unbiased low-rank optimization method built upon GaLore's mechanism and the Muon algorithm, named GaLore Unbiased with Muon (GUM). We theoretically prove our method matches the convergence guarantees of the base Muon algorithm while preserving the memory efficiency of low-rank techniques. Empirical experiments on LLM fine-tuning and pretraining also demonstrate non-trivial improvements over GaLore and even better performance than full-parameter training. Further investigation shows that the improvement of this technique comes from a more uniform distribution of knowledge inside layers, leading to more efficient utilization of the model parameter space and better memorization.
翻译:内存高效优化对于训练日益增大的语言模型至关重要。一种流行策略涉及梯度低秩投影,仅存储投影后的优化器状态,其中GaLore是代表性示例。然而,此类方法存在显著缺陷,即缺乏收敛性保证,因为各种低秩投影方法会引入相对于原始优化算法的固有偏差,导致与全参数训练相比存在性能差距。为解决该问题,本文研究用于消除低秩投影机制偏差的逐层采样技术。具体而言,该范式的实例化产生了一种基于GaLore机制与Muon算法的新型无偏低秩优化方法,命名为GaLore Unbiased with Muon。我们从理论上证明该方法在保持低秩技术内存效率的同时,匹配基础Muon算法的收敛性保证。在LLM微调与预训练上的实证实验也表明,相较于GaLore取得了显著改进,甚至优于全参数训练。进一步研究表明,该技术的改进源于层内知识分布更均匀,从而实现了模型参数空间更高效的利用和更好的记忆能力。