Recent work on Open Domain Question Answering has shown that there is a large discrepancy in model performance between novel test questions and those that largely overlap with training questions. However, it is unclear which aspects of novel questions make them challenging. Drawing upon studies on systematic generalization, we introduce and annotate questions according to three categories that measure different levels and kinds of generalization: training set overlap, compositional generalization (comp-gen), and novel-entity generalization (novel-entity). When evaluating six popular parametric and non-parametric models, we find that for the established Natural Questions and TriviaQA datasets, even the strongest model performance for comp-gen/novel-entity is 13.1/5.4% and 9.6/1.5% lower compared to that for the full test set -- indicating the challenge posed by these types of questions. Furthermore, we show that whilst non-parametric models can handle questions containing novel entities relatively well, they struggle with those requiring compositional generalization. Lastly, we find that key question difficulty factors are: cascading errors from the retrieval component, frequency of question pattern, and frequency of the entity.
翻译:最近关于开放域问题解答的工作表明,新测试问题和与培训问题基本重叠的问题之间在模型性能方面存在很大差异。然而,尚不清楚新问题在哪些方面具有挑战性。根据系统化概括化研究,我们根据三个类别提出和说明问题,这三个类别衡量不同程度和类别的一般化:培训组重叠、构成性概括(共性)和新实体通用(新实体)。在评价六个受欢迎的参数和非参数性模型时,我们发现,对于既定的自然问题和TriviaQA数据集来说,甚至与整个测试组相比,最强的模型性能为13.1/5.4%和9.6/1.5 % -- -- 表明这类问题构成的挑战。此外,我们表明,虽然非参数性模型能够处理包含新实体的问题相对较好,但它们与需要构成性概括化的问题相冲突。最后,我们发现关键问题的困难因素是:从实体的检索部分、问题模式的频率和频率中导出错误。