Full parameter fine tuning is a key technique for adapting large language models (LLMs) to downstream tasks, but it incurs substantial memory overhead due to the need to cache extensive intermediate activations for backpropagation. This bottleneck makes full fine tuning of contemporary large scale LLMs challenging in practice. Existing distributed training frameworks such as DeepSpeed alleviate this issue using techniques like ZeRO and FSDP, which rely on multi GPU memory or CPU offloading, but often require additional hardware resources and reduce training speed. We introduce RevFFN, a memory efficient fine tuning paradigm for mixture of experts (MoE) LLMs. RevFFN employs carefully designed reversible Transformer blocks that allow reconstruction of layer input activations from outputs during backpropagation, eliminating the need to store most intermediate activations in memory. While preserving the expressive capacity of MoE architectures, this approach significantly reduces peak memory consumption for full parameter fine tuning. As a result, RevFFN enables efficient full fine tuning on a single consumer grade or server grade GPU.
翻译:全参数微调是将大语言模型适配至下游任务的关键技术,但由于反向传播过程中需要缓存大量中间激活值,其内存开销巨大。这一瓶颈使得当代大规模大语言模型的全参数微调在实际应用中面临挑战。现有的分布式训练框架(如DeepSpeed)通过采用ZeRO和FSDP等技术缓解此问题,这些技术依赖于多GPU内存或CPU卸载,但通常需要额外的硬件资源并降低训练速度。本文提出RevFFN,一种面向混合专家大语言模型的内存高效微调范式。RevFFN采用精心设计的可逆Transformer块,允许在反向传播过程中从输出重建层输入激活值,从而无需在内存中存储大部分中间激活值。该方法在保持混合专家架构表达能力的同时,显著降低了全参数微调的峰值内存消耗。因此,RevFFN能够在单张消费级或服务器级GPU上实现高效的全参数微调。