EdgeBERT: 延时软件多任务NLP推论的句级能源优化 (EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference)

Thierry Tambe,Coleman Hooper,Lillian Pentecost,Tianyu Jia,En-Yu Yang,Marco Donato,Victor Sanh,Paul N. Whatmough,Alexander M. Rush,David Brooks,Gu-Yeon Wei

from arxiv, 12 pages plus references. Paper to appear at the 54th IEEE/ACM International Symposium on Microarchitecture (MICRO 2021)

Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT, an in-depth algorithm-hardware co-design for latency-aware energy optimization for multi-task NLP. EdgeBERT employs entropy-based early exit predication in order to perform dynamic voltage-frequency scaling (DVFS), at a sentence granularity, for minimal energy consumption while adhering to a prescribed target latency. Computation and memory footprint overheads are further alleviated by employing a calibrated combination of adaptive attention span, selective network pruning, and floating-point quantization. Furthermore, in order to maximize the synergistic benefits of these algorithms in always-on and intermediate edge computing settings, we specialize a 12nm scalable hardware accelerator system, integrating a fast-switching low-dropout voltage regulator (LDO), an all-digital phase-locked loop (ADPLL), as well as, high-density embedded non-volatile memories (eNVMs) wherein the sparse floating-point bit encodings of the shared multi-task parameters are carefully stored. Altogether, latency-aware multi-task NLP inference acceleration on the EdgeBERT hardware system generates up to 7x, 2.5x, and 53x lower energy compared to the conventional inference without early stopping, the latency-unbounded early exit approach, and CUDA adaptations on an Nvidia Jetson Tegra X2 mobile GPU, respectively.

翻译：NLPER 等基于变换语言模型为多种自然语言处理任务提供了显著的精度改进。但是,它们的粗重计算和记忆要求使得它们难以在严格潜延要求下部署到资源限制的边缘平台。我们展示了EdgeBERT, 这是用于多塔斯克 NLP的深重智能能量优化的高级算法硬件共同设计。 EdgeBERT 使用基于英基的早期退出预言, 以便在一个句式颗粒度上进行动态电压频率缩放(DVFS), 用于最小能源消耗,同时遵守规定的常规升压。我们使用调整组合的适应性硬度范围、选择性网络调整和浮动点四分位化等组合来进一步降低对精度的精度。此外,为了在常值和中端计算环境中最大限度地发挥这些算法的协同效益,我们专门设计了一种12纳米可伸缩的软硬件, 将NVPFAL-S-OD-OD-S-SD-SD-Silental-Silental-deal-ral-lieval-lieval-deal-lievildal-lieval-de-de-de-de-lievildal-deal-deal-deal-deal-deal-de-de-deal-deal-deal-de-de-de-ligal-ligal- dis- disal-ligal- dislationdaldal- dis- dis- dislationdaldaldaldaldaldaldaldaldaldaldal-sildal-sildal-s-sildal-s-s-s-s-sild-s-sild-sild-sildal-s-s-sildal-sil-s-sild-s-d-s-dal-dal-dal-dal-d-d-d-dal-s-d-d-d-d-d-d-d-dal-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-