Question Answering for complex questions is often modeled as a graph construction or traversal task, where a solver must build or traverse a graph of facts that answer and explain a given question. This "multi-hop" inference has been shown to be extremely challenging, with few models able to aggregate more than two facts before being overwhelmed by "semantic drift", or the tendency for long chains of facts to quickly drift off topic. This is a major barrier to current inference models, as even elementary science questions require an average of 4 to 6 facts to answer and explain. In this work we empirically characterize the difficulty of building or traversing a graph of sentences connected by lexical overlap, by evaluating chance sentence aggregation quality through 9,784 manually-annotated judgments across knowledge graphs built from three free-text corpora (including study guides and Simple Wikipedia). We demonstrate semantic drift tends to be high and aggregation quality low, at between 0.04% and 3%, and highlight scenarios that maximize the likelihood of meaningfully combining information.
翻译:复杂问题的解答往往以图表构建或跨度任务为模型, 解决者必须建立或翻转一个能够回答和解释一个特定问题的事实图表。 这种“ 多跳”的推论证明极具挑战性, 很少有模型能够在“ 静态漂移” 或长串事实快速偏离主题的倾向下集两个以上的事实。 这是当前推论模型的一大障碍, 因为即使是基础科学问题也需要平均4到6个事实来回答和解释。 在这项工作中,我们从经验上将难以建立或翻转一个与词汇重叠相联系的句子图表作为特征,通过9 784个人工附加说明的判决书来评估机会判决汇总质量, 横跨三个自由文字形体( 包括研究指南和简单维基百科)建立的知识形图。 我们证明语流往往高, 且组合质量低, 介于0.04%到3%之间, 并突显出尽可能有意义地整合信息的情景。