This paper proposes Mandheling, the first system that enables highly resource-efficient on-device training by orchestrating the mixed-precision training with on-chip Digital Signal Processing (DSP) offloading. Mandheling fully explores the advantages of DSP in integer-based numerical calculation by four novel techniques: (1) a CPU-DSP co-scheduling scheme to mitigate the overhead from DSP-unfriendly operators; (2) a self-adaptive rescaling algorithm to reduce the overhead of dynamic rescaling in backward propagation; (3) a batch-splitting algorithm to improve the DSP cache efficiency; (4) a DSP-compute subgraph reusing mechanism to eliminate the preparation overhead on DSP. We have fully implemented Mandheling and demonstrate its effectiveness through extensive experiments. The results show that, compared to the state-of-the-art DNN engines from TFLite and MNN, Mandheling reduces the per-batch training time by 5.5$\times$ and the energy consumption by 8.9$\times$ on average. In end-to-end training tasks, Mandheling reduces up to 10.7$\times$ convergence time and 13.1$\times$ energy consumption, with only 1.9%-2.7% accuracy loss compared to the FP32 precision setting.
翻译:本文建议采用以下四个新技术来充分探索基于整数计算DSP的优势:(1) CPU-DSP共同安排计划,以减轻DSP不友好的操作员的间接费用;(2) 自我调整的再缩放算法,以减少动态后向传播中超缩缩缩缩缩缩缩缩缩的间接费用;(3) 分批算法,以提高DSP缓存效率;(4) DSP计算子集成重用机制,以消除DSP的准备间接费用。 我们通过广泛的实验,充分实施了Mandheling并展示了其有效性。 结果表明,与来自TFLite和MNNE的最新的DNN引擎相比,Mandhelming将每批培训时间减少5.5美元,将能源消耗减少8.9美元,平均减少8.9美元。 与13-231的精密度培训任务相比,仅将每批培训时间缩短至19.7%的耗损率。