Executing machine learning inference tasks on resource-constrained edge devices requires careful hardware-software co-design optimizations. Recent examples have shown how transformer-based deep neural network models such as ALBERT can be used to enable the execution of natural language processing (NLP) inference on mobile systems-on-chip housing custom hardware accelerators. However, while these existing solutions are effective in alleviating the latency, energy, and area costs of running single NLP tasks, achieving multi-task inference requires running computations over multiple variants of the model parameters, which are tailored to each of the targeted tasks. This approach leads to either prohibitive on-chip memory requirements or paying the cost of off-chip memory access. This paper proposes adapter-ALBERT, an efficient model optimization for maximal data reuse across different tasks. The proposed model's performance and robustness to data compression methods are evaluated across several language tasks from the GLUE benchmark. Additionally, we demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator to extrapolate performance, power, and area improvements over the execution of a traditional ALBERT model on the same hardware platform.
翻译:在资源受限的边缘设备上执行机器学习推断任务需要仔细的硬件-软件协同设计优化。最近的例子已经表明,如ALBERT等基于transformer的深度神经网络模型可以用于在移动片上系统上运行自然语言处理(NLP)推断,这些系统配备了定制的硬件加速器。然而,这些现有的解决方案在减轻运行单个NLP任务的延迟、能量和面积成本方面是有效的,但实现多任务推断需要在针对每个目标任务量身定制的多个模型参数变体之间运行计算。这种方法会导致禁止在芯片上内存的要求或支付外部内存访问的成本。本文提出了Adapter-ALBERT,这是一种最大限度地利用不同任务之间的数据复用的高效模型优化。我们在GLUE基准测试中评估了所提出模型的性能和对数据压缩方法的鲁棒性。此外,我们演示了将模型映射到异构芯片上内存架构的优点,通过在验证的NLP边缘加速器上进行模拟,推断相对于在同一硬件平台上执行传统ALBERT模型的性能、功率和面积的改进。