Executing machine learning inference tasks on resource-constrained edge devices requires careful hardware-software co-design optimizations. Recent examples have shown how transformer-based deep neural network models such as ALBERT can be used to enable the execution of natural language processing (NLP) inference on mobile systems-on-chip housing custom hardware accelerators. However, while these existing solutions are effective in alleviating the latency, energy, and area costs of running single NLP tasks, achieving multi-task inference requires running computations over multiple variants of the model parameters, which are tailored to each of the targeted tasks. This approach leads to either prohibitive on-chip memory requirements or paying the cost of off-chip memory access. This paper proposes adapter-ALBERT, an efficient model optimization for maximal data reuse across different tasks. The proposed model's performance and robustness to data compression methods are evaluated across several language tasks from the GLUE benchmark. Additionally, we demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator to extrapolate performance, power, and area improvements over the execution of a traditional ALBERT model on the same hardware platform.
翻译:执行机器学习推理任务需要在资源受限的边缘设备上进行仔细的硬件-软件协同设计优化。最近的例子表明,诸如ALBERT之类的基于Transformer的深度神经网络模型可以用于在承载定制硬件加速器的移动系统芯片上实现自然语言处理(NLP)推理的执行。然而,虽然这些现有的解决方案在减轻单个NLP任务的延迟、能耗和面积成本方面非常有效,但实现多任务推理需要在适合每个目标任务的多个变量的模型参数上运行计算。这种方法会导致要么禁止在芯片上的内存要求,要么要支付离线内存访问的成本。本文提出了adapter-ALBERT,这是一种用于在不同任务之间实现最大数据重用的有效模型优化。评估了所提出的模型在GLUE基准测试的几个语言任务中的性能和对数据压缩方法的鲁棒性。此外,我们展示了将模型映射到异构芯片上的内存体系结构优势,通过对已验证的NLP边缘加速器进行模拟,从而推断出在相同硬件平台上执行传统ALBERT模型的性能、功率和面积改进。