Recent efforts to create challenge benchmarks that test the abilities of natural language understanding models have largely depended on human annotations. In this work, we introduce the "Break, Perturb, Build" (BPB) framework for automatic reasoning-oriented perturbation of question-answer pairs. BPB represents a question by decomposing it into the reasoning steps that are required to answer it, symbolically perturbs the decomposition, and then generates new question-answer pairs. We demonstrate the effectiveness of BPB by creating evaluation sets for three reading comprehension (RC) benchmarks, generating thousands of high-quality examples without human intervention. We evaluate a range of RC models on our evaluation sets, which reveals large performance gaps on generated examples compared to the original data. Moreover, symbolic perturbations enable fine-grained analysis of the strengths and limitations of models. Last, augmenting the training data with examples generated by BPB helps close performance gaps, without any drop on the original data distribution.
翻译:最近为建立挑战基准以测试自然语言理解模型的能力所作的努力主要取决于人文说明。在这项工作中,我们引入了“突破、 Perturb、build”框架(BBB),以自动推理方式干扰问答对配。BPB代表了一个问题,将它分解为必要的推理步骤,象征性地干扰了分解,然后产生了新的问答对配。我们通过为三种阅读理解(RC)基准建立评价组,产生了数千个高质量的实例,而没有人类的干预。我们评估了我们评价组的一系列RC模型,这些模型显示与原始数据相比,在生成的示例上存在很大的绩效差距。此外,象征性的扰动使得能够对模型的长处和局限性进行细微分析。最后,用BPB生成的示例来补充培训数据有助于缩小绩效差距,而没有减少原始数据分布。