Recent pretrained language models "solved" many reading comprehension benchmarks, where questions are written with access to the evidence document. However, datasets containing information-seeking queries where evidence documents are provided after the queries are written independently remain challenging. We analyze why answering information-seeking queries is more challenging and where their prevalent unanswerabilities arise, on Natural Questions and TyDi QA. Our controlled experiments suggest two headrooms -- paragraph selection and answerability prediction, i.e. whether the paired evidence document contains the answer to the query or not. When provided with a gold paragraph and knowing when to abstain from answering, existing models easily outperform a human annotator. However, predicting answerability itself remains challenging. We manually annotate 800 unanswerable examples across six languages on what makes them challenging to answer. With this new data, we conduct per-category answerability prediction, revealing issues in the current dataset collection as well as task formulation. Together, our study points to avenues for future research in information-seeking question answering, both for dataset creation and model development.
翻译:最近经过事先训练的语文模型“解决”了许多阅读理解基准,其中的问题是在查阅证据文件时填写的。然而,包含信息查询查询的数据集,在查询后提供证据文件是独立地填写的,这些数据集仍然具有挑战性。我们分析为什么回答信息查询查询的询问更具有挑战性,以及为什么在自然问题和Tydi QA上,它们普遍出现无法回答的问题。我们控制的实验显示有两个头目 -- -- 段落选择和可回答性预测,即对称证据文件是否包含对查询的答案。当提供黄金段落并知道何时不回答时,现有模型很容易超越人类说明者。然而,预测可回答性本身仍然具有挑战性。我们用六种语言手写了800个无法回答的示例,说明何为难以回答的问题。有了这个新数据,我们进行了每类可回答性预测,揭示了当前数据集收集中的问题以及任务配置。我们的研究共同指出,未来在解答信息查询问题时,无论是为了数据集的创建还是模型开发,如何,如何研究的途径。