变换头板的可区别的子节制 (Differentiable Subset Pruning of Transformer Heads) - 专知论文

会员服务 ·

0

剪枝 · Performer · 注意力机制 · 随机梯度下降 · Transformer ·

2021 年 8 月 22 日

Differentiable Subset Pruning of Transformer Heads

翻译：变换头板的可区别的子节制

Jiaoda Li,Ryan Cotterell,Mrinmaya Sachan

from arxiv, TACL 2021

Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer's multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. Intuitively, our method learns per-head importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. The importance variables are learned via stochastic gradient descent. We conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.

翻译：多头关注是独立处理输入的不同部分的数种关注机制的集合,是变形器的关键成分。然而,最近的工作表明,变形器多头关注机制中的一大部分头可以安全地切除,而不会严重损害模型的性能;这种裁剪导致实际操作中明显较小和更快的模型。我们的工作引入了一种新的头裁技术,我们称之为可区分子裁剪。直觉上,我们的方法学习了每个头重要性变量,然后对未划线的头数施加了用户指定的硬性限制。重要变量通过随机梯度梯度下降来学习。我们在自然语言推断和机器翻译方面进行实验;我们显示,可区分的子裁剪的功能比以前的工作更具有可比性或更好性,同时提供准确的对宽度水平的控制。

0

相关内容

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日

最新《非光滑优化》十讲硬核课程，剑桥大学梁经纬博士主讲

最新《非光滑优化》十讲硬核课程，剑桥大学梁经纬博士主讲

专知会员服务

33+阅读 · 2020年8月14日

【Google】平滑对抗训练，Smooth Adversarial Training

【Google】平滑对抗训练，Smooth Adversarial Training

专知会员服务

49+阅读 · 2020年7月4日

知识图谱推理，50页ppt，Salesforce首席科学家Richard Socher

知识图谱推理，50页ppt，Salesforce首席科学家Richard Socher

专知会员服务

111+阅读 · 2020年6月10日

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

专知会员服务

112+阅读 · 2020年5月15日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

已删除

将门创投

3+阅读 · 2019年11月25日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

最佳实践：深度学习用于自然语言处理（三）

最佳实践：深度学习用于自然语言处理（三）

待字闺中

3+阅读 · 2017年8月20日

FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes

Arxiv

0+阅读 · 2021年10月15日

PTQ-SL: Exploring the Sub-layerwise Post-training Quantization

Arxiv

0+阅读 · 2021年10月15日

Efficient and Private Federated Learning with Partially Trainable Networks

Arxiv

0+阅读 · 2021年10月6日

Learning Hard Retrieval Decoder Attention for Transformers

Arxiv

0+阅读 · 2021年9月10日

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Arxiv

21+阅读 · 2020年12月17日

Pretrained Transformers for Text Ranking: BERT and Beyond

Arxiv

28+阅读 · 2020年10月13日

On Layer Normalization in the Transformer Architecture

Arxiv

4+阅读 · 2020年2月12日

The Evolved Transformer

The Evolved Transformer

Arxiv

5+阅读 · 2019年1月30日

On The Alignment Problem In Multi-Head Attention-Based Neural Machine Translation

On The Alignment Problem In Multi-Head Attention-Based Neural Machine Translation

Arxiv

3+阅读 · 2018年9月11日

DARTS: Differentiable Architecture Search

Arxiv

3+阅读 · 2018年6月24日

VIP会员

文章信息

相关主题

注意力机制

随机梯度下降

相关VIP内容

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日

最新《非光滑优化》十讲硬核课程，剑桥大学梁经纬博士主讲

最新《非光滑优化》十讲硬核课程，剑桥大学梁经纬博士主讲

专知会员服务

33+阅读 · 2020年8月14日

【Google】平滑对抗训练，Smooth Adversarial Training

【Google】平滑对抗训练，Smooth Adversarial Training

专知会员服务

49+阅读 · 2020年7月4日

知识图谱推理，50页ppt，Salesforce首席科学家Richard Socher

知识图谱推理，50页ppt，Salesforce首席科学家Richard Socher

专知会员服务

111+阅读 · 2020年6月10日

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

专知会员服务

112+阅读 · 2020年5月15日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

热门VIP内容

开通专知VIP会员享更多权益服务

《多域空战指挥体系：驾驭复杂性的艺术》

构建军事人工智能信任体系始于破除黑盒机制

《生态建模密码破译：建模与编程实践》美陆军最新报告

《战争形态演变：合成兵种防御主导模式探析》48页slides

相关资讯

已删除

将门创投

3+阅读 · 2019年11月25日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

最佳实践：深度学习用于自然语言处理（三）

最佳实践：深度学习用于自然语言处理（三）

待字闺中

3+阅读 · 2017年8月20日

相关论文

FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes

Arxiv

0+阅读 · 2021年10月15日

PTQ-SL: Exploring the Sub-layerwise Post-training Quantization

Arxiv

0+阅读 · 2021年10月15日

Efficient and Private Federated Learning with Partially Trainable Networks

Arxiv

0+阅读 · 2021年10月6日

Learning Hard Retrieval Decoder Attention for Transformers

Arxiv

0+阅读 · 2021年9月10日

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Arxiv

21+阅读 · 2020年12月17日

Pretrained Transformers for Text Ranking: BERT and Beyond

Arxiv

28+阅读 · 2020年10月13日

On Layer Normalization in the Transformer Architecture

Arxiv

4+阅读 · 2020年2月12日

The Evolved Transformer

The Evolved Transformer

Arxiv

5+阅读 · 2019年1月30日

On The Alignment Problem In Multi-Head Attention-Based Neural Machine Translation

On The Alignment Problem In Multi-Head Attention-Based Neural Machine Translation

Arxiv

3+阅读 · 2018年9月11日

DARTS: Differentiable Architecture Search

Arxiv

3+阅读 · 2018年6月24日

微信扫码咨询专知VIP会员