Question answering (QA) models have shown compelling results in the task of Machine Reading Comprehension (MRC). Recently these systems have proved to perform better than humans on held-out test sets of datasets e.g. SQuAD, but their robustness is not guaranteed. The QA model's brittleness is exposed when evaluated on adversarial generated examples by a performance drop. In this study, we explore the robustness of MRC models to entity renaming, with entities from low-resource regions such as Africa. We propose EntSwap, a method for test-time perturbations, to create a test set whose entities have been renamed. In particular, we rename entities of type: country, person, nationality, location, organization, and city, to create AfriSQuAD2. Using the perturbed test set, we evaluate the robustness of three popular MRC models. We find that compared to base models, large models perform well comparatively on novel entities. Furthermore, our analysis indicates that entity type person highly challenges the MRC models' performance.
翻译:问答(QA)模型在机器阅读理解(MRC)任务中表现出了令人信服的结果。最近,这些系统已经证明在持有测试数据集(例如SQuAD)上的表现要优于人类,但它们的鲁棒性并不保证。在评估对抗性生成的示例时,QA模型的脆弱性会暴露出来。在本研究中,我们探讨了MRC模型对实体重命名的鲁棒性,其中包括来自非洲等低资源地区的实体。我们提出了一种名为EntSwap的方法,用于测试时的扰动,以创建一个实体被重命名的测试集。特别是,我们将类型为国家、人物、国籍、位置、组织和城市的实体重新命名,以创建AfriSQuAD2。使用扰动的测试集,我们评估了三个流行的MRC模型的鲁棒性。我们发现,与基础模型相比,大型模型在新颖实体上的表现相对良好。此外,我们的分析表明,人物实体类型对MRC模型的性能构成了极大挑战。