边际功用 Diminishes:探索BERT知识蒸馏的最低限度知识 (Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation) - 专知论文

会员服务 ·

0

蒸馏 · Performer · BERT · 极小点 · 隐状态 ·

2021 年 6 月 10 日

Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation

翻译：边际功用 Diminishes:探索BERT知识蒸馏的最低限度知识

Yuanxin Liu,Fandong Meng,Zheng Lin,Weiping Wang,Jie Zhou

from arxiv, Accepted by ACL 2021

Recently, knowledge distillation (KD) has shown great success in BERT compression. Instead of only learning from the teacher's soft label as in conventional KD, researchers find that the rich information contained in the hidden layers of BERT is conducive to the student's performance. To better exploit the hidden knowledge, a common practice is to force the student to deeply mimic the teacher's hidden states of all the tokens in a layer-wise manner. In this paper, however, we observe that although distilling the teacher's hidden state knowledge (HSK) is helpful, the performance gain (marginal utility) diminishes quickly as more HSK is distilled. To understand this effect, we conduct a series of analysis. Specifically, we divide the HSK of BERT into three dimensions, namely depth, length and width. We first investigate a variety of strategies to extract crucial knowledge for each single dimension and then jointly compress the three dimensions. In this way, we show that 1) the student's performance can be improved by extracting and distilling the crucial HSK, and 2) using a tiny fraction of HSK can achieve the same performance as extensive HSK distillation. Based on the second finding, we further propose an efficient KD paradigm to compress BERT, which does not require loading the teacher during the training of student. For two kinds of student models and computing devices, the proposed KD paradigm gives rise to training speedup of 2.7x ~ 3.4x.

翻译：最近,知识蒸馏(KD)在BERT压缩中表现出了巨大的成功。研究人员不仅从教师的软标签上学习传统KD中的软标签,而且发现BERT隐藏层所含的丰富信息有助于学生的表现。为了更好地利用隐藏的知识,一个常见的做法是迫使学生以层化的方式深度模仿教师隐藏的所有代币的状态。然而,在本文件中,我们观察到,虽然将教师的隐蔽的国家知识(HSK)蒸馏出来是有益的,但随着HSK的蒸馏,业绩增益(边际效用)却迅速减少。为了理解这一效果,我们进行了一系列分析。具体地说,我们把BERT的HSK分成三个层面,即深度、长度和宽度。我们首先调查各种策略,以提取每个单一层面的关键知识,然后共同压缩这三个层面。我们通过提取和提取关键HSK学生的速率,我们展示学生的成绩(MGK)和2,使用HS-K的微小版模型,我们用一个小的D级模型来进一步提升学习。

0

相关内容

Knowledge In PLM: 语言模型可以作为一种知识库吗？

专知会员服务

30+阅读 · 2021年6月15日

【AAAI2021】LRC-BERT：对比学习潜在语义知识蒸馏的自然语言理解

专知会员服务

27+阅读 · 2020年12月31日

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

专知会员服务

36+阅读 · 2020年4月14日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【百度】上下文化知识图谱嵌入，CoKE: Contextualized Knowledge Graph Embedding

【百度】上下文化知识图谱嵌入，CoKE: Contextualized Knowledge Graph Embedding

专知会员服务

80+阅读 · 2019年11月8日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

已删除

将门创投

7+阅读 · 2019年10月10日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

论文浅尝 | Leveraging Knowledge Bases in LSTMs

论文浅尝 | Leveraging Knowledge Bases in LSTMs

开放知识图谱

6+阅读 · 2017年12月8日

Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification

Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification

Arxiv

0+阅读 · 2021年8月5日

Online Knowledge Distillation for Efficient Pose Estimation

Online Knowledge Distillation for Efficient Pose Estimation

Arxiv

0+阅读 · 2021年8月4日

LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding

LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding

Arxiv

6+阅读 · 2020年12月14日

Progressive Network Grafting for Few-Shot Knowledge Distillation

Progressive Network Grafting for Few-Shot Knowledge Distillation

Arxiv

4+阅读 · 2020年12月9日

Knowledge Distillation in Wide Neural Networks: Risk Bound, Data Efficiency and Imperfect Teacher

Arxiv

4+阅读 · 2020年10月20日

Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks

Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks

Arxiv

13+阅读 · 2020年4月13日

Knowledge Distillation from Internal Representations

Knowledge Distillation from Internal Representations

Arxiv

4+阅读 · 2019年10月8日

TinyBERT: Distilling BERT for Natural Language Understanding

TinyBERT: Distilling BERT for Natural Language Understanding

Arxiv

11+阅读 · 2019年9月23日

KG-BERT: BERT for Knowledge Graph Completion

Arxiv

15+阅读 · 2019年9月11日

Improving Question Answering by Commonsense-Based Pre-Training

Arxiv

5+阅读 · 2018年10月5日

VIP会员

文章信息

相关主题

相关VIP内容

Knowledge In PLM: 语言模型可以作为一种知识库吗？

专知会员服务

30+阅读 · 2021年6月15日

【AAAI2021】LRC-BERT：对比学习潜在语义知识蒸馏的自然语言理解

专知会员服务

27+阅读 · 2020年12月31日

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

专知会员服务

36+阅读 · 2020年4月14日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【百度】上下文化知识图谱嵌入，CoKE: Contextualized Knowledge Graph Embedding

【百度】上下文化知识图谱嵌入，CoKE: Contextualized Knowledge Graph Embedding

专知会员服务

80+阅读 · 2019年11月8日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

热门VIP内容

开通专知VIP会员享更多权益服务

【博士论文】低维与高维空间中潜在表征的分析、建模与变换

《生态建模密码破译：建模与编程实践》美陆军最新报告

大模型解决方案白皮书：社交陪伴场景全流程落地指南

面向具身操作的视觉-语言-动作模型综述

相关资讯

已删除

将门创投

7+阅读 · 2019年10月10日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

论文浅尝 | Leveraging Knowledge Bases in LSTMs

论文浅尝 | Leveraging Knowledge Bases in LSTMs

开放知识图谱

6+阅读 · 2017年12月8日

相关论文

Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification

Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification

Arxiv

0+阅读 · 2021年8月5日

Online Knowledge Distillation for Efficient Pose Estimation

Online Knowledge Distillation for Efficient Pose Estimation

Arxiv

0+阅读 · 2021年8月4日

LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding

LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding

Arxiv

6+阅读 · 2020年12月14日

Progressive Network Grafting for Few-Shot Knowledge Distillation

Progressive Network Grafting for Few-Shot Knowledge Distillation

Arxiv

4+阅读 · 2020年12月9日

Knowledge Distillation in Wide Neural Networks: Risk Bound, Data Efficiency and Imperfect Teacher

Arxiv

4+阅读 · 2020年10月20日

Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks

Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks

Arxiv

13+阅读 · 2020年4月13日

Knowledge Distillation from Internal Representations

Knowledge Distillation from Internal Representations

Arxiv

4+阅读 · 2019年10月8日

TinyBERT: Distilling BERT for Natural Language Understanding

TinyBERT: Distilling BERT for Natural Language Understanding

Arxiv

11+阅读 · 2019年9月23日

KG-BERT: BERT for Knowledge Graph Completion

Arxiv

15+阅读 · 2019年9月11日

Improving Question Answering by Commonsense-Based Pre-Training

Arxiv

5+阅读 · 2018年10月5日

微信扫码咨询专知VIP会员