Recent work on Open Domain Question Answering has shown that there is a large discrepancy in model performance between novel test questions and those that largely overlap with training questions. However, it is as of yet unclear which aspects of novel questions that make them challenging. Drawing upon studies on systematic generalization, we introduce and annotate questions according to three categories that measure different levels and kinds of generalization: training set overlap, compositional generalization (comp-gen), and novel entity generalization (novel-entity). When evaluating six popular parametric and non-parametric models, we find that for the established Natural Questions and TriviaQA datasets, even the strongest model performance for comp-gen/novel-entity is 13.1/5.4% and 9.6/1.5% lower compared to that for the full test set -- indicating the challenge posed by these types of questions. Furthermore, we show that whilst non-parametric models can handle questions containing novel entities, they struggle with those requiring compositional generalization. Through thorough analysis we find that key question difficulty factors are: cascading errors from the retrieval component, frequency of question pattern, and frequency of the entity.
翻译:关于开放域问题回答的近期工作表明,新测试问题和与培训问题基本重叠的问题之间在示范性表现方面存在很大差异。然而,尚不清楚新问题有哪些方面具有挑战性。根据系统化概括化研究,我们根据三个类别提出和说明问题,衡量不同层次和类别的一般化:培训集重叠、构成性概括(comple-gen)和新实体一般化(nvel-entity)。在评价六个流行的参数和非参数模型时,我们发现,对于既定的自然问题和TriviaQA数据集而言,甚至与完整测试集相比,共和/新实体最强的模型性表现为13.1/5/5.4%和9.6.5% -- -- 表明这类问题构成的挑战。此外,我们表明,虽然非参数模型可以处理包含新实体的问题,但它们与需要构成性概括化的问题相冲突。我们通过透彻的分析发现,关键的问题困难因素是:从实体的检索部分、问题模式的频率和频率中导出错误。