In order to reliably process natural language, NLP systems must generalize to the long tail of rare utterances. We propose a method to create challenging benchmarks that require generalizing to the tail of the distribution by re-splitting existing datasets. We create 'Likelihood splits' where examples that are assigned lower likelihood by a pre-trained language model (LM) are placed in the test set, and more likely examples are in the training set. This simple approach can be customized to construct meaningful train-test splits for a wide range of tasks. Likelihood splits are more challenging than random splits: relative error rates of state-of-the-art models on our splits increase by 59% for semantic parsing on Spider, 77% for natural language inference on SNLI, and 38% for yes/no question answering on BoolQ compared with the corresponding random splits. Moreover, Likelihood splits create fairer benchmarks than adversarial filtering; when the LM used to create the splits is used as the task model, our splits do not adversely penalize the LM.
翻译:为了可靠地处理自然语言, NLP 系统必须推广到稀有语句的长尾。 我们提出一种方法来创建挑战性基准, 要求通过再分割现有数据集, 将分布的尾端普遍化。 我们创建了“ 纯化分割 ”, 测试集中被预先培训的语言模式( LM) 分配的示例的可能性较低, 而培训集中则有更可能的例子。 这种简单的方法可以定制, 用于构建有意义、 适用于一系列广泛任务的火车测试分解。 可能性分割比随机分割更具挑战性: 蜘蛛语语中最先进的模型的相对误差率增加了59%, SNLI 的自然语言推论增加了77%, 布尔克语中回答的是/ 不问题增加了38% 。 此外, 类似分割可以创建比对称过滤法更公平的基准; 当任务模型使用LM 来创建分法时, 我们的分解不会对LM 产生负面的惩罚性 LM 。