Naturally-occurring information-seeking questions often contain questionable assumptions -- assumptions that are false or unverifiable. Questions containing questionable assumptions are challenging because they require a distinct answer strategy that deviates from typical answers to information-seeking questions. For instance, the question "When did Marie Curie discover Uranium?" cannot be answered as a typical when question without addressing the false assumption "Marie Curie discovered Uranium". In this work, we propose (QA)$^2$ (Question Answering with Questionable Assumptions), an open-domain evaluation dataset consisting of naturally-occurring search engine queries that may or may not contain questionable assumptions. To be successful on (QA)$^2$, systems must be able to detect questionable assumptions and also be able to produce adequate responses for both typical information-seeking questions and ones with questionable assumptions. We find that current models do struggle with handling questionable assumptions -- the best performing model achieves 59% human rater acceptability on abstractive QA with (QA)$^2$ questions, leaving substantial headroom for progress.
翻译:寻求信息的自然问题往往包含有疑问的假设 -- -- 假设是假的或无法核实。包含有疑问的假设的问题具有挑战性,因为它们需要不同于对信息查询问题的典型答案的独特回答策略。例如,“Marie Curie何时发现铀?”这个问题在回答问题时不能作为一个典型的典型问题而不回答错误的假设“Marie Curie发现铀”。 在这项工作中,我们提议(QA)$2 (回答有疑问的假设的问题),这是一套公开的评估数据集,由自然发生的搜索引擎查询组成,可能包含或可能不包含可疑的假设。要成功(QA)$2,各系统必须能够检测可疑的假设,并且能够对典型的信息查询问题和有疑问的假设的问题作出充分的答复。我们发现,目前的模型在处理有疑问的假设时会遇到困难 -- -- 最好的执行模型在抽象的QA和有(QA)$2美元的问题上可以接受59%的费率,留有大量的头要进展。