Transformer-based language models such as BERT provide significant accuracy improvement to a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT an in-depth and principled algorithm and hardware design methodology to achieve minimal latency and energy consumption on multi-task NLP inference. Compared to the ALBERT baseline, we achieve up to 2.4x and 13.4x inference latency and memory savings, respectively, with less than 1%-pt drop in accuracy on several GLUE benchmarks by employing a calibrated combination of 1) entropy-based early stopping, 2) adaptive attention span, 3) movement and magnitude pruning, and 4) floating-point quantization. Furthermore, in order to maximize the benefits of these algorithms in always-on and intermediate edge computing settings, we specialize a scalable hardware architecture wherein floating-point bit encodings of the shareable multi-task embedding parameters are stored in high-density non-volatile memory. Altogether, EdgeBERT enables fully on-chip inference acceleration of NLP workloads with 5.2x, and 157x lower energy than that of an un-optimized accelerator and CUDA adaptations on an Nvidia Jetson Tegra X2 mobile GPU, respectively.
翻译:以变异器为基础的语言模型(如BERT)为多种自然语言处理(NLP)任务提供了显著的精度改进。然而,它们的粗重计算和记忆要求使得它们难以在严格潜伏要求下部署到资源紧张的边缘平台。我们向EgeBERT展示了一种深度和有原则的算法和硬件设计方法,以便在多塔斯克 NLP 推断中实现最小的延缓度和能量消耗。与ALBERT基准相比,我们实现了最高达2.4x和13.4x推导延度和记忆节减,在GLUE的若干基准中,精确度低于1%,为此采用了以下校准组合:(1) 英特基早期停用,(2) 适应性注意范围,(3) 移动和规模调整,以及(4) 浮动点四分化。此外,为了在始终和中间边缘计算环境中最大限度地利用这些算法的好处,我们专门设计了一个可伸缩的硬性硬性硬性结构,使可分享的多任务嵌入参数的浮点位元编码编码分别存储在高密度度、N-PUD-RUT-RUTER能够充分适应NHC-C-C-C-C-C-C-SG-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-HAT-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-