从 200 多篇顶会论文看预训练语言模型研究进展

2021 年 12 月 4 日 专知

机构｜中国人民大学高瓴人工智能学院博士一年级

导师｜赵鑫教授

研究方向 | 对话系统和预训练模型

近年来，以 BERT 和 GPT 系列为代表的大规模预训练语言模型（Pre-trained Language Model, PLM）在 NLP 的各个领域取得了巨大成功。本文整理了自 BERT 和 GPT 诞生以来与 PLM 相关的论文，根据引用数筛选出其中一些具有代表性的工作和 2021 年在各大顶会（ACL、EMNLP、ICLR、ICML、NeurIPS 等）发表的工作，共计 285 篇，按照综述、基准数据集、PLM 的设计、PLM 的分析、高效的 PLM 和 PLM 的使用这 6 个大类 22 个小类进行了划分。文章也同步发布在 AI Box 知乎专栏（知乎搜索 AI Box 专栏），欢迎大家在知乎专栏的文章下方评论留言，交流探讨！

本文整理的论文列表已经同步更新到 GitHub，GitHub 上会持续更新顶会论文，欢迎大家关注和 Star。

https://github.com/RUCAIBox/PLMPapers

本文按照综述、基准数据集、PLM 的设计、PLM 的分析、高效的 PLM 和 PLM 的使用这 6 个大类 22 个小类进行了划分：

· 1 综述·

· 2 基准数据集·

· 3 PLM 的设计·

通用设计
知识增强
多语言
多模态
信息检索
代码
其他

· 4 PLM 的分析·

知识
鲁棒性
稀疏性
其他

· 5 高效的 PLM·

模型训练
模型推理
模型压缩

· 6 PLM 的使用·

两阶段微调
多任务微调
Adapter
Prompt
其他

综述

"Pre-trained models for natural language processing: A survey". Science China Technological Sciences(2020) [PDF]
"Which *BERT? A Survey Organizing Contextualized Encoders". EMNLP(2020) [PDF]
"A Primer in BERTology: What We Know About How BERT Works". TACL(2020) [PDF]
"From static to dynamic word representations: a survey". International Journal of Machine Learning and Cybernetics(2020) [PDF]
"Overview of the Transformer-based Models for NLP Tasks". 2020 15th Conference on Computer Science and Information Systems (FedCSIS) [PDF]
"A Survey on Contextual Embeddings". arXiv(2020) [PDF]
"The NLP Cookbook: Modern Recipes for Transformer Based Deep Learning Architectures". IEEE Access(2021) [PDF]
"Pre-Trained Models: Past, Present and Future". arXiv(2021) [PDF]
"Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing". arXiv(2021) [PDF]
"AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing". arXiv(2021) [PDF]
"On the Opportunities and Risks of Foundation Models". arXiv(2021) [PDF]
"Paradigm Shift in Natural Language Processing". arXiv(2021) [PDF]
"Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey". arXiv(2021) [PDF]

基准数据集

XNLI: "XNLI: Evaluating Cross-lingual Sentence Representations". EMNLP(2018) [PDF] [Dataset]
GLUE: "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". ICLR(2019) [Homepage]
SuperGLUE: "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems". NeurIPS(2019) [Homepage]
CLUE: "CLUE: A Chinese Language Understanding Evaluation Benchmark". COLING(2020) [Homepage]
XTREME: "XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization". ICML(2020) [Homepage]
XGLUE: "XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation". EMNLP(2020) [Homepage]
DialoGLUE: "DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue". arXiv(2020) [Homepage]

PLM 的设计

3.1 通用设计

GPT: "Improving Language Understanding by Generative Pre-Training". OpenAI(2018) [Project]
GPT-2: "Language Models are Unsupervised Multitask Learners". OpenAI(2019) [Project]
BERT: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL(2019) [PDF] [Code]
XLNet: "XLNet: Generalized Autoregressive Pretraining for Language Understanding". NeurIPS(2019) [PDF] [Code]
SBERT: "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks". ACL(2019) [PDF] [Code]
UniLM: "Unified Language Model Pre-training for Natural Language Understanding and Generation". NeurIPS(2019) [PDF] [Code]
MASS: "MASS: Masked Sequence to Sequence Pre-training for Language Generation". ICML(2019) [PDF] [Code]
Chinese-BERT-wwm: "Pre-Training with Whole Word Masking for Chinese BERT". arXiv(2019) [PDF] [Code]
"Cloze-driven Pretraining of Self-attention Networks". EMNLP(2019) [PDF]
"BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model". Workshop on Methods for Optimizing and Evaluating Neural Language Generation(2019) [PDF] [Code]
GPT-3: "Language Models are Few-Shot Learners". NeurIPS(2020) [PDF] [Code]
T5: "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". JMLR(2020) [PDF] [Code]
BART: "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension". ACL(2020) [PDF] [Code]
Poly-encoders: "Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring". ICLR(2020) [PDF]
SpanBERT: "SpanBERT: Improving Pre-training by Representing and Predicting Spans". TACL(2020) [PDF] [Code]
ERNIE 2.0: "ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding". AAAI(2020) [PDF] [Code]
SemBERT: "Semantics-Aware BERT for Language Understanding". AAAI(2020) [PDF] [Code]
"Leveraging Pre-trained Checkpoints for Sequence Generation Tasks". TACL(2020) [PDF] [Code]
ProphetNet: "ProphetNet: Predicting Future N-gram for Sequence-to-SequencePre-training". EMNLP(2020) [PDF]
UniLMv2: "UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training". ICML(2020) [PDF] [Code]
MacBERT: "Revisiting Pre-Trained Models for Chinese Natural Language Processing". EMNLP(2020) [PDF] [Code]
MPNet: "MPNet: Masked and Permuted Pre-training for Language Understanding". arXiv(2020) [PDF] [Code]
DEBERTA: "DeBERTa: Decoding-enhanced BERT with Disentangled Attention". ICLR(2021) [PDF] [Code]
PALM: "PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation". EMNLP(2020) [PDF]
Optimus: "Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space". EMNLP(2020) [PDF] [Code]
"Self-training Improves Pre-training for Natural Language Understanding". NAACL(2021) [PDF] [Code]
CAPT: "Rethinking Denoised Auto-Encoding in Language Pre-Training". EMNLP(2021) [PDF]
"Frustratingly Simple Pretraining Alternatives to Masked Language Modeling". EMNLP(2021) [PDF] [Code]
"Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models". ACL(2021) [PDF] [Code]
ERNIE-Doc: "ERNIE-Doc: A Retrospective Long-Document Modeling Transformer". ACL(2021) [PDF] [Code]
"Pre-training Universal Language Representation". ACL(2021) [PDF] [Code]

3.2 知识增强

ERNIE(Baidu): "ERNIE: Enhanced Representation through Knowledge Integration". arXiv(2019) [PDF] [Code]
KnowBert: "Knowledge Enhanced Contextual Word Representations". EMNLP(2019) [PDF]
ERNIE(Tsinghua): "ERNIE: Enhanced Language Representation with Informative Entities". ACL(2019) [PDF] [Code]
COMET: "COMET: Commonsense Transformers for Automatic Knowledge Graph Construction". ACL(2019) [PDF] [Code]
K-BERT: "K-BERT: Enabling Language Representation with Knowledge Graph". AAAI(2020) [PDF] [Code]
WKLM: "Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model". ICLR(2020) [PDF]
LUKE: "LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention". EMNLP(2020) [PDF] [Code]
K-Adapter: "K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters". ICLR(2021) [PDF]
KEPLER: "KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation". TACL(2021) [PDF] [Code]
RuleBERT: "RuleBERT: Teaching Soft Rules to Pre-Trained Language Models". EMNLP(2021) [PDF] [Code]
BeliefBank: "Exploring the Role of BERT Token Representations to Explain Sentence Probing Results". EMNLP(2021) [PDF] [Code]
Phrase-BERT: "Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration". EMNLP(2021) [PDF] [Code]
"Syntax-Enhanced Pre-trained Model". ACL(2021) [PDF] [Code]
StructFormer: "StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling". ACL(2021) [PDF]
ERICA: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning". ACL(2021) [PDF] [Code]
"Structural Guidance for Transformer Language Models". ACL(2021) [PDF] [Code]
HORNET: "HORNET: Enriching Pre-trained Language Representations with Heterogeneous Knowledge Sources". CIKM(2021) [PDF]
"Drop Redundant, Shrink Irrelevant: Selective Knowledge Injection for Language Pretraining". IJCAI(2021) [PDF]

3.3 多语言

XLM: "Cross-lingual Language Model Pretraining". arXiv(2019) [PDF] [Code]
"Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond". TACL(2019) [PDF] [Code]
UDify: "75 Languages, 1 Model: Parsing Universal Dependencies Universally". EMNLP(2019) [PDF] [Code]
Unicoder: "Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks". EMNLP(2019) [PDF]
XLM-R: "Unsupervised Cross-lingual Representation Learning at Scale". ACL(2020) [PDF]
"Multilingual Alignment of Contextual Word Representations". ICLR(2020) [PDF]
mBART: "Multilingual Denoising Pre-training for Neural Machine Translation". TACL(2020) [PDF] [Code]
mT5: "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer". NAACL(2021) [PDF] [Code]
InfoXLM: "InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training". NAACL(2021) [PDF] [Code]
"Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training". EMNLP(2021) [PDF] [Code]
ERNIE-M: "ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora". EMNLP(2021) [PDF] [Code]
"A Simple Geometric Method for Cross-Lingual Linguistic Transformations with Pre-trained Autoencoders". EMNLP(2021) [PDF]
"Boosting Cross-Lingual Transfer via Self-Learning with Uncertainty Estimation". EMNLP(2021) [PDF]
"How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models". ACL(2021) [PDF] [Code]
"Multilingual Pre-training with Universal Dependency Learning". NeurIPS(2021) [PDF]

3.4 多模态

ViLBERT: "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks". NeuralIPS(2019) [PDF]
LXMERT: "LXMERT: Learning Cross-Modality Encoder Representations from Transformers". EMNLP(2019) [PDF] [Code]
VideoBERT: "VideoBERT: A Joint Model for Video and Language Representation Learning" ICCV(2019) [PDF]
VisualBERT: "VisualBERT: A Simple and Performant Baseline for Vision and Language". arXiv(2019) [PDF]
B2T2: "Fusion of Detected Objects in Text for Visual Question Answering". EMNLP(2019) [PDF] [Code]
VL-BERT: "VL-BERT: Pre-training of Generic Visual-Linguistic Representations". ICLR(2020) [PDF] [Code]
Unicoder-VL: "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training". AAAI(2020) [PDF]
VLP: "Unified Vision-Language Pre-Training for Image Captioning and VQA". AAAI(2020) [PDF] [Code]
UNITER: "UNITER: UNiversal Image-TExt Representation Learning". ECCV(2020) [PDF] [Code]
Oscar: "Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks". ECCV(2020) [PDF] [Code]
"12-in-1: Multi-Task Vision and Language Representation Learning". CVPR(2020) [PDF] [Code]
ActBERT: "ActBERT: Learning Global-Local Video-Text Representations". CVPR(2020) [PDF]
VLN: "Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks". CVPR(2020) [PDF]
VILLA: "Large-Scale Adversarial Training for Vision-and-Language Representation Learning". arXiv(2020) [PDF] [Code]
ImageBERT: "ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data". arXiv(2020) [PDF]
ALIGN: "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision". ICML(2021) [PDF]
ClipBERT: "Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling". CVPR(2021) [PDF] [Code]
DALL·E: "Zero-Shot Text-to-Image Generation". arXiv(2021) [PDF] [Code]
CLIP: "Learning Transferable Visual Models From Natural Language Supervision". arXiv(2021) [PDF] [Code]
IPT: "Pre-Trained Image Processing Transformer". CVPR(2021) [PDF] [Code]
CvT: "CvT: Introducing Convolutions to Vision Transformers". ICCV(2021) [PDF] [Code]
"Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision". ICML(2021) [PDF]
TERA: "TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech". TASLP(2021) [PDF] [Code]
CaiT: "Going deeper with Image Transformers". ICCV(2021) [PDF] [Code]
ViViT: "ViViT: A Video Vision Transformer". ICCV(2021) [PDF] [Code]
VirTex: "VirTex: Learning Visual Representations From Textual Annotations". CVPR(2021) [PDF] [Code]
M6: "M6: Multi-Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining". KDD(2021) [PDF]
"Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training". NeurIPS(2021) [PDF]
GilBERT: "GilBERT: Generative Vision-Language Pre-Training for Modality-Incomplete Visual-Linguistic Tasks". SIGIR(2021) [PDF]

3.5 信息检索

ORQA: "Latent Retrieval for Weakly Supervised Open Domain Question Answering". ACL(2019) [PDF]
REALM: "REALM: Retrieval-Augmented Language Model Pre-Training". arXiv(2020) [PDF]
RAG: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". NeurIPS(2020) [PDF] [Code]
DPR: "Dense Passage Retrieval for Open-Domain Question Answering". EMNLP(2020) [PDF] [Code]
"Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering". EACL(2021) [PDF] [Code]

3.6 代码

CodeT5: "CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation". EMNLP(2021) [PDF] [Code]
Codex: "Evaluating Large Language Models Trained on Code". arXiv(2021) [PDF] [Code]

3.7 其他

ReasonBERT: "ReasonBERT: Pre-trained to Reason with Distant Supervision". EMNLP(2021) [PDF] [Code]
"Sentence Bottleneck Autoencoders from Transformer Language Models". EMNLP(2021) [PDF] [Code]
"Numeracy enhances the Literacy of Language Models". EMNLP(2021) [PDF] [Code]
EnsLM: "EnsLM: Ensemble Language Model for Data Diversity by Semantic Clustering". ACL(2021) [PDF] [Code]
"Reflective Decoding: Beyond Unidirectional Generation with Off-the-Shelf Language Models". ACL(2021) [PDF] [Code]
BERTAC: "BERTAC: Enhancing Transformer-based Language Models with Adversarially Pretrained Convolutional Neural Networks". ACL(2021) [PDF] [Code]
"Natural Language Understanding with Privacy-Preserving BERT". CIKM(2021) [PDF]
BANG: "BANG: Bridging Autoregressive and Non-autoregressive Generation with Large Scale Pretraining". ICML(2021) [PDF] [Code]

PLM 的分析

4.1 知识

"What Does BERT Look at? An Analysis of BERT’s Attention". BlackBoxNLP(2019) [PDF] [Code]
"BERT Rediscovers the Classical NLP Pipeline". ACL(2019) [PDF]
"How Multilingual is Multilingual BERT?". ACL(2019) [PDF]
"A Structural Probe for Finding Syntax in Word Representations". NAACL(2019) [PDF] [Code]
"Language Models as Knowledge Bases?". EMNLP(2019) [PDF] [Code]
"What Does BERT Learn about the Structure of Language?". ACL(2019) [PDF] [Code]
"Linguistic Knowledge and Transferability of Contextual Representations". NAACL(2019) [PDF]
"Assessing BERT's Syntactic Abilities". arXiv(2019) [PDF] [Code]
"Probing Neural Network Comprehension of Natural Language Arguments" ACL(2019) [PDF]
"How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings". EMNLP(2019) [PDF]
"Visualizing and Measuring the Geometry of BERT". NeurIPS(2019) [PDF]
"Designing and Interpreting Probes with Control Tasks". EMNLP(2019) [PDF]
"Open Sesame: Getting inside BERT’s Linguistic Knowledge". BlackboxNLP(2019) [PDF] [Code]
"What do you learn from context? Probing for sentence structure in contextualized word representations". ICLR(2019) [PDF] [Code]
"Commonsense Knowledge Mining from Pretrained Models". EMNLP(2019) [PDF]
"Do NLP Models Know Numbers? Probing Numeracy in Embeddings". EMNLP(2019) [PDF]
"On the Cross-lingual Transferability of Monolingual Representations". ACL(2020) [PDF]
"Cross-Lingual Ability of Multilingual BERT: An Empirical Study". ICLR(2020) [PDF] [Code]
"What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models". TACL(2020) [PDF] [Code]
"How Much Knowledge Can You Pack Into the Parameters of a Language Model?". EMNLP(2020) [PDF] [Code]
"How Can We Know What Language Models Know?". TACL(2020) [PDF] [Code]
"oLMpics-On What Language Model Pre-training Captures". TACL(2020) [PDF] [Code]
"Information-Theoretic Probing with Minimum Description Length". EMNLP(2020) [PDF] [Code]
"Inducing Relational Knowledge from BERT". AAAI(2020) [PDF]
AutoPrompt: "AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts". EMNLP(2020) [PDF] [Code]
"Emergent linguistic structure in artificial neural networks trained by self-supervision". PNAS(2020) [PDF]
"Evaluating Commonsense in Pre-Trained Language Models". AAAI(2020) [PDF] [Code]
"Inducing Relational Knowledge from BERT". AAAI(2020) [PDF]
"Editing Factual Knowledge in Language Models". EMNLP(2021) [PDF] [Code]
"How much pretraining data do language models need to learn syntax?". EMNLP(2021) [PDF]
"Stepmothers are mean and academics are pretentious: What do pretrained language models learn about you?". EMNLP(2021) [PDF] [Code]
"Putting Words in BERT's Mouth: Navigating Contextualized Vector Spaces with Pseudowords". EMNLP(2021) [PDF] [Code]
"Frequency Effects on Syntactic Rule Learning in Transformers". EMNLP(2021) [PDF] [Code]
"Exploring the Role of BERT Token Representations to Explain Sentence Probing Results". EMNLP(2021) [PDF] [Code]
"How is BERT surprised? Layerwise detection of linguistic anomalies". ACL(2021) [PDF] [Code]
"Implicit Representations of Meaning in Neural Language Model". ACL(2021) [PDF] [Code]
"Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases". ACL(2021) [PDF] [Code]

4.2 鲁棒性

"Universal Adversarial Triggers for Attacking and Analyzing NLP". EMNLP(2019) [PDF] [Code]
"Pretrained Transformers Improve Out-of-Distribution Robustness". ACL(2020) [PDF] [Code]
BERT-ATTACK: "BERT-ATTACK: Adversarial Attack Against BERT Using BERT". EMNLP(2020) [PDF] [Code]
"Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment". AAAI(2020) [PDF] [Code]
"The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers". EMNLP(2021) [PDF] [Code]
"Sorting through the noe: Testing robustness of information processing in pre-trained language models". EMNLP(2021) [PDF] [Code]

4.3 稀疏性

"Are Sixteen Heads Really Better than One?". NeurIPS(2019) [PDF] [Code]
"Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned". ACL(2019) [PDF] [Code]
"Revealing the Dark Secrets of BERT". EMNLP(2019) [PDF]
"The Lottery Ticket Hypothesis for Pre-trained BERT Networks". NeurIPS(2020) [PDF] [Code]
"When BERT Plays the Lottery, All Tickets Are Winning". EMNLP(2020) [PDF] [Code]

4.4 其他

"Scaling Laws for Neural Language Models". arXiv(2020) [PDF]
"Extracting Training Data from Large Language Models". arXiv(2020) [PDF] [Code]
"On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜". FACCT(2021) [PDF]
"Extracting Training Data from Large Language Models". USENIX(2021) [PDF] [Code]
"Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little". EMNLP(2021) [PDF] [Code]
"Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent". EMNLP(2021) [PDF] [Code]
"Discretized Integrated Gradients for Explaining Language Models". EMNLP(2021) [PDF] [Code]
"Do Long-Range Language Models Actually Use Long-Range Context?". EMNLP(2021) [PDF]
"Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right". EMNLP(2021) [PDF] [Code]
"Incorporating Residual and Normalization Layers into Analysis of Masked Language Models". EMNLP(2021) [PDF] [Code]
"Sequence Length is a Domain: Length-based Overfitting in Transformer Models". EMNLP(2021) [PDF]
"Are Pretrained Convolutions Better than Pretrained Transformers?". ACL(2021) [PDF]
"Positional Artefacts Propagate Through Masked Language Model Embeddings". ACL(2021) [PDF]
"When Do You Need Billions of Words of Pretraining Data?". ACL(2021) [PDF] [Code]
"BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?". ACL(2021) [PDF] [Code]
"Examining the Inductive Bias of Neural Language Models with Artificial Languages". ACL(2021) [PDF] [Code]
"Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning". NeurIPS(2021) [PDF]

高效的 PLM

5.1 模型训练

RoBERTa: "RoBERTa: A Robustly Optimized BERT Pretraining Approach". arXiv(2019) [PDF] [Code]
"Efficient Training of BERT by Progressively Stacking". ICML(2019) [PDF] [Code]
Megatron-LM: "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism". arXiv(2019) [PDF] [Code]
ELECTRA: "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators". ICLR(2020) [PDF] [Code]
"Large Batch Optimization for Deep Learning: Training BERT in 76 minutes". ICLR(2020) [PDF] [Code]
GShard: "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding". arXiv(2020) [PDF]
Admin: "Understanding the Difficulty of Training Transformers". EMNLP(2020) [PDF] [Code]
ZeRO: "ZeRO: Memory optimizations Toward Training Trillion Parameter Models". SC20: International Conference for High Performance Computing, Networking, Storage and Analysis [PDF] [Code]
Switch Transformers: "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity". arXiv(2021) [PDF] [Code]
"How to Train BERT with an Academic Budget". EMNLP(2021) [PDF]
"Optimizing Deeper Transformers on Small Datasets". ACL(2021) [PDF] [Code]
"EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets". ACL(2021) [PDF] [Code]

5.2 模型推理

"BERT Loses Patience: Fast and Robust Inference with Early Exit". NeurIPS(2020) [PDF] [Code]
GAML-BERT: "GAML-BERT: Improving BERT Early Exiting by Gradient Aligned Mutual Learning". EMNLP(2021) [PDF]
"Efficient Nearest Neighbor Language Models". EMNLP(2021) [PDF] [Code]
GhostBERT: "GhostBERT: Generate More Features with Cheap Operations for BERT". ACL(2021) [PDF] [Code]
LeeBERT: "LeeBERT: Learned Early Exit for BERT with cross-level optimization". ACL(2021) [PDF]
"Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search". ACL(2021) [PDF] [Code]
"Distilling Knowledge from BERT into Simple Fully Connected Neural Networks for Efficient Vertical Retrieval". CIKM(2021) [PDF]

5.3 模型压缩

DistilBERT: "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter". arXiv(2019) [PDF] [Code]
PKD: "Patient Knowledge Distillation for BERT Model Compression". EMNLP(2019) [PDF] [Code]
"Distilling Task-Specific Knowledge from BERT into Simple Neural Networks". arXiv(2019) [PDF]
Q8BERT: "Q8BERT: Quantized 8Bit BERT". 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019 [PDF]
ALBERT: "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations". ICLR(2020) [PDF] [Code]
TinyBERT: "TinyBERT: Distilling BERT for Natural Language Understanding". EMNLP(2020) [PDF] [Code]
Layerdrop: "Reducing Transformer Depth on Demand with Structured Dropout". ICLR(2020) [PDF] [Code]
Q-BERT: "Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT". AAAI(2020) [PDF]
MobileBERT: "MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices". ACL(2020) [PDF] [Code]
"Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning". 5th Workshop on Representation Learning for NLP(2020) [PDF] [Code]
MiniLM: "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers". arXiv(2020) [PDF] [Code]
FastBERT: "FastBERT: a Self-distilling BERT with Adaptive Inference Time". ACL(2020) [PDF] [Code]
DeeBERT: "DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference". ACL(2020) [PDF] [Code]
"Compressing Large-Scale Transformer-Based Models: A Case Study on BERT". TACL(2021) [PDF]
"Winning the Lottery with Continuous Sparsification". NeurIPS(2020) [PDF] [Code]
SqueezeBERT: "SqueezeBERT: What can computer vision teach NLP about efficient neural networks?". SustaiNLP(2020) [PDF]
Audio ALBERT: "Audio Albert: A Lite Bert for Self-Supervised Learning of Audio Representation". SLT(2021) [PDF] [Code]
T2R: "Finetuning Pretrained Transformers into RNNs". EMNLP(2021) [PDF] [Code]
"Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression". EMNLP(2021) [PDF] [Code]
Meta-KD: "Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains". ACL(2021) [PDF] [Code]
"Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization". ACL(2021) [PDF] [Code]
BinaryBERT: "BinaryBERT: Pushing the Limit of BERT Quantization". ACL(2021) [PDF] [Code]
AutoTinyBERT: "AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models". ACL(2021) [PDF] [Code]
"Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation". ACL(2021) [PDF] [Code]
"Enabling Lightweight Fine-tuning for Pre-trained Language Model Compression based on Matrix Product Operators". ACL(2021) [PDF] [Code]
NAS-BERT: "NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural Architecture Search". KDD(2021) [PDF]

PLM 的使用

6.1 两阶段微调

"Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks". arXiv(2018) [PDF] [Code]
"How to Fine-Tune BERT for Text Classification?". CCL(2019) [PDF]
"Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks". ACL(2020) [PDF] [Code]
"Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?". ACL(2020) [PDF]
"What to Pre-Train on? Efficient Intermediate Task Selection". EMNLP(2021) [PDF] [Code]
"On the Influence of Masking Policies in Intermediate Pre-training". EMNLP(2021) [PDF]
TADPOLE: "TADPOLE: Task ADapted Pre-Training via AnOmaLy DEtection". EMNLP(2021) [PDF]

6.2 多任务微调

MT-DNN: "Multi-Task Deep Neural Networks for Natural Language Understanding". ACL(2019) [PDF] [Code]
"BAM! Born-Again Multi-Task Networks for Natural Language Understanding". ACL(2019) [PDF] [Code]
"Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding". arXiv(2019) [PDF] [Code]
GradTS: "GradTS: A Gradient-Based Automatic Auxiliary Task Selection Method Based on Transformer Networks". EMNLP(2021) [PDF]
"What's in Your Head? Emergent Behaviour in Multi-Task Transformer Models". EMNLP(2021) [PDF]
MTAdam: "MTAdam: Automatic Balancing of Multiple Training Loss Terms". EMNLP(2021) [PDF]
Muppet: "Muppet: Massive Multi-task Representations with Pre-Finetuning". EMNLP(2021) [PDF]
"The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders". EMNLP(2021) [PDF] [Code]
BERTGen: "BERTGen: Multi-task Generation through BERT". ACL(2021) [PDF] [Code]
"Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks". ACL(2021) [PDF] [Code]

6.3 Adapter

"BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning". ICML(2019) [PDF] [Code]
Adapter: "Parameter-Efficient Transfer Learning for NLP". ICML(2019) [PDF] [Code]
AdapterDrop: "AdapterDrop: On the Efficiency of Adapters in Transformers". EMNLP(2021) [PDF]
"On the Effectiveness of Adapter-based Tuning for Pretrained Language Model Adaptation". ACL(2021) [PDF]
"Learning to Generate Task-Specific Adapters from Task Description". ACL(2021) [PDF] [Code]

6.4 Prompt

PET: "Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference". EACL(2021) [PDF] [Code]
"It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners". NAACL(2021) [PDF] [Code]
"Prefix-Tuning: Optimizing Continuous Prompts for Generation". arXiv(2021) [PDF]
LM-BFF: "Making Pre-trained Language Models Better Few-shot Learners". ACL(2021) [PDF] [Code]
"What Makes Good In-Context Examples for GPT-3?". arXiv(2021) [PDF] [Code]
"The Power of Scale for Parameter-Efficient Prompt Tuning". EMNLP(2021) [PDF] [Code]
"Finetuned Language Models Are Zero-Shot Learners". arXiv(2021) [PDF]
"Calibrate Before Use: Improving Few-shot Performance of Language Models". ICML(2021) [PDF] [Code]
TransPrompt: "TransPrompt: Towards an Automatic Transferable Prompting Framework for Few-shot Text Classification". EMNLP(2021) [PDF] [Code]
SFLM: "Revisiting Self-training for Few-shot Learning of Language Model". EMNLP(2021) [PDF] [Code]
ADAPET: "Improving and Simplifying Pattern Exploiting Training". EMNLP(2021) [PDF] [Code]

6.5 其他

"To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks". RepL4NLP(2019) [PDF]
"An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models". NAACL(2019) [PDF] [Code]
"Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping". arXiv(2020) [PDF]
SMART: "SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization". EMNLP(2020) [PDF] [Code]
"Revisiting Few-sample BERT Fine-tuning". ICLR(2021) [PDF]
Mirror-BERT: "Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders". EMNLP(2021) [PDF] [Code]
"Pre-train or Annotate? Domain Adaptation with a Constrained Budget". EMNLP(2021) [PDF] [Code]
AVocaDo: "AVocaDo: Strategy for Adapting Vocabulary to Downstream Domain". EMNLP(2021) [PDF]
CHILD-TUNING: "Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning". EMNLP(2021) [PDF] [Code]
"Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation". ACL(2021) [PDF] [Code]
LexFit: "LexFit: Lexical Fine-Tuning of Pretrained Language Models". ACL(2021) [PDF] [Code]
"Selecting Informative Contexts Improves Language Model Fine-tuning". ACL(2021) [PDF] [Code]
"An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models". ACL(2021) [PDF] [Code]
"How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?". NeurIPS(2021) [PDF] [Code]