魔法金字塔:加速早期退出和托肯·普鲁宁的推断 (Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning) - 专知论文

会员服务 ·

0

可约的 · 剪枝 · Pyramid · 推断 · 词元分析器 ·

2021 年 10 月 30 日

Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning

翻译：魔法金字塔:加速早期退出和托肯·普鲁宁的推断

Xuanli He,Iman Keivanloo,Yi Xu,Xiang He,Belinda Zeng,Santosh Rajagopalan,Trishul Chilimbi

from arxiv, 8 pages

Pre-training and then fine-tuning large language models is commonly used to achieve state-of-the-art performance in natural language processing (NLP) tasks. However, most pre-trained models suffer from low inference speed. Deploying such large models to applications with latency constraints is challenging. In this work, we focus on accelerating the inference via conditional computations. To achieve this, we propose a novel idea, Magic Pyramid (MP), to reduce both width-wise and depth-wise computation via token pruning and early exiting for Transformer-based models, particularly BERT. The former manages to save the computation via removing non-salient tokens, while the latter can fulfill the computation reduction by terminating the inference early before reaching the final layer, if the exiting condition is met. Our empirical studies demonstrate that compared to previous state of arts, MP is not only able to achieve a speed-adjustable inference but also to surpass token pruning and early exiting by reducing up to 70% giga floating point operations (GFLOPs) with less than 0.5% accuracy drop. Token pruning and early exiting express distinctive preferences to sequences with different lengths. However, MP is capable of achieving an average of 8.06x speedup on two popular text classification tasks, regardless of the sizes of the inputs.

翻译：培训前和再微调大型语言模型通常用于在自然语言处理(NLP)任务中实现最先进的表现。然而,大多数经过预先训练的模型都具有低推力速度。将大型模型用于带有潜伏限制的应用是具有挑战性的。在这项工作中,我们侧重于通过有条件计算加速推论。为了实现这一点,我们提出了一个新颖的想法,即Magic Pyramid(MP),通过象征性的裁剪和提前退出基于变压器的模型,特别是BERT,减少宽度和深度计算。前者设法通过删除非静态符号来节省计算,而后者可以通过在达到最后层之前提前终止推断来完成计算减少。如果符合退出条件,我们的经验研究表明,与以往的艺术状态相比,MP不仅能够实现快速可调控的推算,而且通过将70%的悬浮点操作(GULOPs)降低到低于0.5%的精确度,而后者可以在到达最后层之前通过终止推算来完成计算。然而,在两种不同的变形的顺序上,Token pass pass sqent squal squal squal squal squal laction.

0

相关内容

可约的

【ICCV2021】基于Transformer 的神经绘画

专知会员服务

23+阅读 · 2021年9月20日

Google-EfficientNet v2来了！更快，更小，更强！

Google-EfficientNet v2来了！更快，更小，更强！

专知会员服务

19+阅读 · 2021年4月4日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

【ACL2020】DeeBERT:动态加速BERT推理，DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

【ACL2020】DeeBERT:动态加速BERT推理，DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

专知会员服务

21+阅读 · 2020年4月30日

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

专知会员服务

26+阅读 · 2020年3月19日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【论文推荐】基于BERT修剪的问答模型（Pruning a BERT-based Question Answering Model）

【论文推荐】基于BERT修剪的问答模型（Pruning a BERT-based Question Answering Model）

专知会员服务

30+阅读 · 2019年11月22日

【O'Reilly TensorFlow World 2019】在NVIDIA GPU上加速训练，推理和ML应用（Accelerating training, inference, and ML applications on NVIDIA GPUs），NVIDIA，Maggie Zhang ，Nathan Luehr，Josh Romero，Pooya Davoodi，Pooya Davoodi

【O'Reilly TensorFlow World 2019】在NVIDIA GPU上加速训练，推理和ML应用（Accelerating training, inference, and ML applications on NVIDIA GPUs），NVIDIA，Maggie Zhang ，Nathan Luehr，Josh Romero，Pooya Davoodi，Pooya Davoodi

专知会员服务

7+阅读 · 2019年11月13日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

知识图谱本体结构构建论文合集

知识图谱本体结构构建论文合集

专知会员服务

109+阅读 · 2019年10月9日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

已删除

架构文摘

3+阅读 · 2019年4月17日

学术会议 | 知识图谱顶会 ISWC 征稿：Poster/Demo

学术会议 | 知识图谱顶会 ISWC 征稿：Poster/Demo

开放知识图谱

5+阅读 · 2019年4月16日

AI/ML/DNN硬件加速设计怎么入门？

AI/ML/DNN硬件加速设计怎么入门？

StarryHeavensAbove

11+阅读 · 2018年12月4日

基于Lattice LSTM的命名实体识别

基于Lattice LSTM的命名实体识别

微信AI

47+阅读 · 2018年10月19日

视觉机械臂 visual-pushing-grasping

视觉机械臂 visual-pushing-grasping

CreateAMind

3+阅读 · 2018年5月25日

【论文推荐】最新五篇命名实体识别相关论文—深度主动学习、Lattice LSTM、混合马尔可夫CRF

【论文推荐】最新五篇命名实体识别相关论文—深度主动学习、Lattice LSTM、混合马尔可夫CRF

专知

26+阅读 · 2018年5月22日

【推荐】YOLO实时目标检测(6fps)

【推荐】YOLO实时目标检测(6fps)

机器学习研究会

20+阅读 · 2017年11月5日

最佳实践：深度学习用于自然语言处理（三）

最佳实践：深度学习用于自然语言处理（三）

待字闺中

3+阅读 · 2017年8月20日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

Dynamic Inference with Neural Interpreters

Arxiv

7+阅读 · 2021年10月12日

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Arxiv

3+阅读 · 2021年3月22日

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Arxiv

12+阅读 · 2020年6月23日

Unified Hypersphere Embedding for Speaker Recognition

Arxiv

5+阅读 · 2018年7月22日

Pragmatically Informative Image Captioning with Character-Level Inference

Arxiv

7+阅读 · 2018年5月10日

Pragmatically Informative Image Captioning with Character-Level Reference

Arxiv

4+阅读 · 2018年4月15日

Simple and Effective Semi-Supervised Question Answering

Arxiv

5+阅读 · 2018年4月2日

Efficient and Deep Person Re-Identification using Multi-Level Similarity

Arxiv

4+阅读 · 2018年4月2日

Mix-and-Match Tuning for Self-Supervised Semantic Segmentation

Arxiv

8+阅读 · 2018年1月30日

A Projected Gradient Descent Method for CRF Inference allowing End-To-End Training of Arbitrary Pairwise Potentials

Arxiv

3+阅读 · 2018年1月2日

VIP会员

文章信息

相关主题

词元分析器

相关VIP内容

【ICCV2021】基于Transformer 的神经绘画

专知会员服务

23+阅读 · 2021年9月20日

Google-EfficientNet v2来了！更快，更小，更强！

Google-EfficientNet v2来了！更快，更小，更强！

专知会员服务

19+阅读 · 2021年4月4日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

【ACL2020】DeeBERT:动态加速BERT推理，DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

【ACL2020】DeeBERT:动态加速BERT推理，DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

专知会员服务

21+阅读 · 2020年4月30日

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

专知会员服务

26+阅读 · 2020年3月19日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【论文推荐】基于BERT修剪的问答模型（Pruning a BERT-based Question Answering Model）

【论文推荐】基于BERT修剪的问答模型（Pruning a BERT-based Question Answering Model）

专知会员服务

30+阅读 · 2019年11月22日

【O'Reilly TensorFlow World 2019】在NVIDIA GPU上加速训练，推理和ML应用（Accelerating training, inference, and ML applications on NVIDIA GPUs），NVIDIA，Maggie Zhang ，Nathan Luehr，Josh Romero，Pooya Davoodi，Pooya Davoodi

【O'Reilly TensorFlow World 2019】在NVIDIA GPU上加速训练，推理和ML应用（Accelerating training, inference, and ML applications on NVIDIA GPUs），NVIDIA，Maggie Zhang ，Nathan Luehr，Josh Romero，Pooya Davoodi，Pooya Davoodi

专知会员服务

7+阅读 · 2019年11月13日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

知识图谱本体结构构建论文合集

知识图谱本体结构构建论文合集

专知会员服务

109+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【牛津博士论文】零样本强化学习综述

《美军条令：陆军指挥官与规划人员地理空间指南》60页

战术边缘指挥控制：防务面临的核心挑战

迈向开放世界检测：综述

相关资讯

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

已删除

架构文摘

3+阅读 · 2019年4月17日

学术会议 | 知识图谱顶会 ISWC 征稿：Poster/Demo

学术会议 | 知识图谱顶会 ISWC 征稿：Poster/Demo

开放知识图谱

5+阅读 · 2019年4月16日

AI/ML/DNN硬件加速设计怎么入门？

AI/ML/DNN硬件加速设计怎么入门？

StarryHeavensAbove

11+阅读 · 2018年12月4日

基于Lattice LSTM的命名实体识别

基于Lattice LSTM的命名实体识别

微信AI

47+阅读 · 2018年10月19日

视觉机械臂 visual-pushing-grasping

视觉机械臂 visual-pushing-grasping

CreateAMind

3+阅读 · 2018年5月25日

【论文推荐】最新五篇命名实体识别相关论文—深度主动学习、Lattice LSTM、混合马尔可夫CRF

【论文推荐】最新五篇命名实体识别相关论文—深度主动学习、Lattice LSTM、混合马尔可夫CRF

专知

26+阅读 · 2018年5月22日

【推荐】YOLO实时目标检测(6fps)

【推荐】YOLO实时目标检测(6fps)

机器学习研究会

20+阅读 · 2017年11月5日

最佳实践：深度学习用于自然语言处理（三）

最佳实践：深度学习用于自然语言处理（三）

待字闺中

3+阅读 · 2017年8月20日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

相关论文

Dynamic Inference with Neural Interpreters

Arxiv

7+阅读 · 2021年10月12日

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Arxiv

3+阅读 · 2021年3月22日

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Arxiv

12+阅读 · 2020年6月23日

Unified Hypersphere Embedding for Speaker Recognition

Arxiv

5+阅读 · 2018年7月22日

Pragmatically Informative Image Captioning with Character-Level Inference

Arxiv

7+阅读 · 2018年5月10日

Pragmatically Informative Image Captioning with Character-Level Reference

Arxiv

4+阅读 · 2018年4月15日

Simple and Effective Semi-Supervised Question Answering

Arxiv

5+阅读 · 2018年4月2日

Efficient and Deep Person Re-Identification using Multi-Level Similarity

Arxiv

4+阅读 · 2018年4月2日

Mix-and-Match Tuning for Self-Supervised Semantic Segmentation

Arxiv

8+阅读 · 2018年1月30日

A Projected Gradient Descent Method for CRF Inference allowing End-To-End Training of Arbitrary Pairwise Potentials

Arxiv

3+阅读 · 2018年1月2日

微信扫码咨询专知VIP会员