Multi-hop reasoning has been widely studied in recent years to obtain more interpretable link prediction. However, we find in experiments that many paths given by these models are actually unreasonable, while little works have been done on interpretability evaluation for them. In this paper, we propose a unified framework to quantitatively evaluate the interpretability of multi-hop reasoning models so as to advance their development. In specific, we define three metrics including path recall, local interpretability, and global interpretability for evaluation, and design an approximate strategy to calculate them using the interpretability scores of rules. Furthermore, we manually annotate all possible rules and establish a Benchmark to detect the Interpretability of Multi-hop Reasoning (BIMR). In experiments, we run nine baselines on our benchmark. The experimental results show that the interpretability of current multi-hop reasoning models is less satisfactory and is still far from the upper bound given by our benchmark. Moreover, the rule-based models outperform the multi-hop reasoning models in terms of performance and interpretability, which points to a direction for future research, i.e., we should investigate how to better incorporate rule information into the multi-hop reasoning model. Our codes and datasets can be obtained from https://github.com/THU-KEG/BIMR.
翻译:近些年来,我们广泛研究了多希望推理,以获得更可解释的联系预测。然而,我们从实验中发现,这些模型提供的许多路径实际上都是不合理的,虽然在可解释性评价方面没有做多少工作。在本文件中,我们提议了一个统一框架,对多希望推理模型的可解释性进行定量评估,以推动其发展。具体地说,我们界定了三个尺度,包括路径回顾、可当地解释性和评价的全球可解释性,并设计了使用可解释性规则分数来计算这些模型的粗略战略。此外,我们人工说明所有可能的规则,并建立一个基准,以发现多希望解释性(BIMR)的可解释性。在实验中,我们运行了9个基准基线。实验结果显示,目前多希望推理模型的可解释性不那么好,而且距离我们基准所设定的上限还远。此外,基于规则的模型在性能和可解释性方面超越多希望推理模型,这指出了未来研究的方向,也就是说,我们应该研究如何更好地将规则信息纳入多希望理论模型中。我们的数据和AGIBIS/AGIS可以获取的数据。