While pretrained language models have exhibited impressive generalization capabilities, they still behave unpredictably under certain domain shifts. In particular, a model may learn a reasoning process on in-domain training data that does not hold for out-of-domain test data. We address the task of predicting out-of-domain (OOD) performance in a few-shot fashion: given a few target-domain examples and a set of models with similar training performance, can we understand how these models will perform on OOD test data? We benchmark the performance on this task when looking at model accuracy on the few-shot examples, then investigate how to incorporate analysis of the models' behavior using feature attributions to better tackle this problem. Specifically, we explore a set of "factors" designed to reveal model agreement with certain pathological heuristics that may indicate worse generalization capabilities. On textual entailment, paraphrase recognition, and a synthetic classification task, we show that attribution-based factors can help rank relative model OOD performance. However, accuracy on a few-shot test set is a surprisingly strong baseline, particularly when the system designer does not have in-depth prior knowledge about the domain shift.
翻译:虽然经过培训的语言模型已经表现出令人印象深刻的概括性能力,但它们仍然在某些领域变化中表现得令人印象深刻。 特别是, 模型可以学习关于内部培训数据的推理过程, 这些数据不能用于外部测试数据。 我们处理的是预测外部(OOOD)性能的任务: 给几个目标领域的例子和一套具有类似培训性能的模型, 我们能否理解这些模型将如何用OOOD测试数据进行操作? 我们在研究少数例子的模型准确性时, 将这一任务的业绩作为基准, 然后调查如何将分析模型的行为纳入分析, 使用特征属性来更好地解决这一问题。 具体地说, 我们探索一套“ 因素”, 旨在揭示与某些病理学超常性能的示范协议, 这可能显示更差的概括性能力。 关于文本要求、 引言式识别和综合分类任务, 我们能否表明基于属性的因素可以帮助将OOD的性能进行相对的模型排序。 但是, 少数点测试的精确性能是一个惊人的强基线, 特别是当系统设计者没有在深入的领域上改变先前的知识范围时。