Multihop reasoning remains an elusive goal as existing multihop benchmarks are known to be largely solvable via shortcuts. Can we create a question answering (QA) dataset that, by construction, \emph{requires} proper multihop reasoning? To this end, we introduce a bottom-up approach that systematically selects composable pairs of single-hop questions that are connected, i.e., where one reasoning step critically relies on information from another. This bottom-up methodology lets us explore a vast space of questions and add stringent filters as well as other mechanisms targeting connected reasoning. It provides fine-grained control over the construction process and the properties of the resulting $k$-hop questions. We use this methodology to create MuSiQue-Ans, a new multihop QA dataset with 25K 2-4 hop questions. Relative to existing datasets, MuSiQue-Ans is more difficult overall (3x increase in human-machine gap), and harder to cheat via disconnected reasoning (e.g., a single-hop model has a 30 point drop in F1). We further add unanswerable contrast questions to produce a more stringent dataset, MuSiQue-Full. We hope our datasets will help the NLP community develop models that perform genuine multihop reasoning.
翻译:多点推理仍然是一个难以实现的目标,因为已知现有的多点推理基准基本上可以通过捷径解。 我们能否通过构建、 emph{requires} 适当的多点推理来创建回答问题( QA) 数据集? 为此, 我们引入了自下而上的方法, 系统地选择相连接的、 共合成的单点问题对齐, 即一个推理步骤严重依赖另一个数据。 这种自下而上的方法让我们探索广泛的问题空间, 并添加严格的过滤器以及其他针对相关推理的机制。 它为构建过程和由此产生的$- hop 问题属性提供精细的监控 。 我们使用这种方法创建 MusiQue- Ans, 一个新的多点 QA数据集, 包含 25K 2-4 跳问题 。 相对于现有的数据集, MuSiQue- Ans 比较困难重重。 并且更难通过断开的推理法( 例如, 单点模型在 F1 中有一个30 点下降 。