Although pretrained Transformers such as BERT achieve high accuracy on in-distribution examples, do they generalize to new distributions? We systematically measure out-of-distribution (OOD) generalization for various NLP tasks by constructing a new robustness benchmark with realistic distribution shifts. We measure the generalization of previous models including bag-of-words models, ConvNets, and LSTMs, and we show that pretrained Transformers' performance declines are substantially smaller. Pretrained transformers are also more effective at detecting anomalous or OOD examples, while many previous models are frequently worse than chance. We examine which factors affect robustness, finding that larger models are not necessarily more robust, distillation can be harmful, and more diverse pretraining data can enhance robustness. Finally, we show where future work can improve OOD robustness.
翻译:尽管诸如BERT等预先培训的变异器在分布中实现了很高的精确度,它们是否概括为新的分布?我们系统地测量各种非分配(OOOD)的概括性,方法是建立一个新的稳健性基准,并进行现实的分布变化。我们测量了包括一袋字型模型、ConvNets和LSTMs在内的以往模型的概括性,我们表明,预先培训的变异器的性能下降幅度要小得多。经过培训的变异器在发现异常或OOOD实例方面也比较有效,而许多以前的模型往往比机会差。我们研究哪些因素影响稳健性,发现较大的模型不一定更稳健,蒸馏可能有害,而更为多样化的训练前数据可以增强稳健性。最后,我们展示了未来工作在哪些方面可以改进OOD的稳健性。