On-device fine-tuning is a critical capability for edge AI systems, which must support adaptation to different agentic tasks under stringent memory constraints. Conventional backpropagation (BP)-based training requires storing layer activations and optimizer states, a demand that can be only partially alleviated through checkpointing. In edge deployments in which the model weights must reside entirely in device memory, this overhead severely limits the maximum model size that can be deployed. Memory-efficient zeroth-order optimization (MeZO) alleviates this bottleneck by estimating gradients using forward evaluations alone, eliminating the need for storing intermediate activations or optimizer states. This enables significantly larger models to fit within on-chip memory, albeit at the cost of potentially longer fine-tuning wall-clock time. This paper first provides a theoretical estimate of the relative model sizes that can be accommodated under BP and MeZO training. We then numerically validate the analysis, demonstrating that MeZO exhibits accuracy advantages under on-device memory constraints, provided sufficient wall-clock time is available for fine-tuning.
翻译:设备端微调是边缘人工智能系统的关键能力,其必须在严格的内存约束下支持适应不同智能体任务。传统的基于反向传播的训练需要存储层激活和优化器状态,这一需求仅能通过检查点技术部分缓解。在模型权重必须完全驻留于设备内存的边缘部署场景中,这种开销严重限制了可部署的最大模型规模。内存高效的零阶优化通过仅使用前向传播评估来估计梯度,消除了存储中间激活或优化器状态的需求,从而缓解了这一瓶颈。这使得显著更大的模型能够适配片上内存,尽管可能以更长的微调实际时间为代价。本文首先从理论上估算了在反向传播和内存高效零阶优化训练下可容纳的相对模型规模。随后通过数值实验验证了该分析,证明在设备端内存约束下,只要提供足够的实际微调时间,内存高效零阶优化展现出精度优势。