On-device training enables the model to adapt to new data collected from the sensors by fine-tuning a pre-trained model. However, the training memory consumption is prohibitive for IoT devices that have tiny memory resources. We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. On-device training faces two unique challenges: (1) the quantized graphs of neural networks are hard to optimize due to mixed bit-precision and the lack of normalization; (2) the limited hardware resource (memory and computation) does not allow full backward computation. To cope with the optimization difficulty, we propose Quantization-Aware Scaling to calibrate the gradient scales and stabilize quantized training. To reduce the memory footprint, we propose Sparse Update to skip the gradient computation of less important layers and sub-tensors. The algorithm innovation is implemented by a lightweight training system, Tiny Training Engine, which prunes the backward computation graph to support sparse updates and offloads the runtime auto-differentiation to compile time. Our framework is the first practical solution for on-device transfer learning of visual recognition on tiny IoT devices (e.g., a microcontroller with only 256KB SRAM), using less than 1/100 of the memory of existing frameworks while matching the accuracy of cloud training+edge deployment for the tinyML application VWW. Our study enables IoT devices to not only perform inference but also continuously adapt to new data for on-device lifelong learning.
翻译:在线培训使模型能够适应传感器通过微调培训前的模型而收集的新数据。 但是,培训记忆消耗量对于拥有微小记忆资源的 IoT 设备来说是令人望而却步的。 我们建议了一个算法系统共同设计框架,使仅用256KB记忆进行在线培训。 在线培训面临两个独特的挑战:(1) 神经网络的量化图形很难优化,原因是位精度混杂和缺乏正常化; (2) 有限的硬件资源(模拟和计算)不允许完全反向计算。 为了应对优化困难,我们提议量化-软件缩放,以校准梯度尺度和稳定四分化培训。 为了减少记忆足迹,我们建议“Sprass 更新” 以跳过较不重要层和子电容器的梯度计算。 算法创新由轻量的培训系统“ 微调培训引擎” 进行实施, 微调计算图表只能支持稀疏的更新, 将运行时的自动差异加到编算时间。 为了适应时间,我们的框架也是在Slima- IM-ML 培训框架上进行一个实际解决方案, 同时将S- mill- train 学习S- train- train- train- train- train- trade for Slieval 。