In the last few years, research and development on Deep Learning models and techniques for ultra-low-power devices in a word, TinyML has mainly focused on a train-then-deploy assumption, with static models that cannot be adapted to newly collected data without cloud-based data collection and fine-tuning. Latent Replay-based Continual Learning (CL) techniques[1] enable online, serverless adaptation in principle, but so farthey have still been too computation and memory-hungry for ultra-low-power TinyML devices, which are typically based on microcontrollers. In this work, we introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power (PULP) processor. We rethink the baseline Latent Replay CL algorithm, leveraging quantization of the frozen stage of the model and Latent Replays (LRs) to reduce their memory cost with minimal impact on accuracy. In particular, 8-bit compression of the LR memory proves to be almost lossless (-0.26% with 3000LR) compared to the full-precision baseline implementation, but requires 4x less memory, while 7-bit can also be used with an additional minimal accuracy degradation (up to 5%). We also introduce optimized primitives for forward and backward propagation on the PULP processor. Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory an amount compatible with embedding in TinyML devices. On an advanced 22nm prototype of our platform, called VEGA, the proposed solution performs onaverage 65x faster than a low-power STM32 L4 microcontroller, being 37x more energy efficient enough for a lifetime of 535h when learning a new mini-batch of data once every minute.
翻译:在过去几年里,关于超低功率装置的深学习模型和技术的研究与开发,TinyML(TinyML)以一个单词的形式,主要侧重于65个当值的电路配置假设,其静态模型无法在没有云基数据收集和微调的情况下适应新收集的数据。基于延迟重播的连续学习(CL)技术[1]使得原则上能够进行在线、服务器无服务器的适应,但迄今为止,这些模型和超低功率的TinyML(TinyML)设备通常以微控制器为基础,因此仍然太过量地计算和记忆-渴望超低功率装置的存储成本。在这项工作中,我们引入了一个基于10个调FP32的平行超低功率处理器(PUPLP)处理器。我们重新思考基底线的基底线重新定位(LWW/SW)平台对终端到CLOVM(ML)的直径直径可几乎不值(0.26 %),而我们用一个直径直径直径直径直径直径直径直径直径直径直径的服务器(SLLLLLLLLV-40),而用一个直径直径直径直路的存储路的操作操作操作的存储程序又要求我们使用一个最短的存储程序进行最短的存储,而需要一个最短的存储,而用最短直路路路路路路路路路路路路路路路路路路路路路路。