In this work, we take a closer look at the evaluation of two families of methods for enriching information from knowledge graphs: Link Prediction and Entity Alignment. In the current experimental setting, multiple different scores are employed to assess different aspects of model performance. We analyze the informativeness of these evaluation measures and identify several shortcomings. In particular, we demonstrate that all existing scores can hardly be used to compare results across different datasets. Moreover, we demonstrate that varying size of the test size automatically has impact on the performance of the same model based on commonly used metrics for the Entity Alignment task. We show that this leads to various problems in the interpretation of results, which may support misleading conclusions. Therefore, we propose adjustments to the evaluation and demonstrate empirically how this supports a fair, comparable, and interpretable assessment of model performance. Our code is available at https://github.com/mberr/rank-based-evaluation.
翻译:在这项工作中,我们更仔细地审视了对从知识图表中丰富信息方法的两个系列的评价:联系预测和实体对齐。在目前的实验环境中,使用多种不同的分数来评估模型业绩的不同方面。我们分析了这些评价措施的信息性,并找出了几个缺点。特别是,我们证明所有现有的分数都难以用于对不同数据集的结果进行比较。此外,我们证明测试规模的不同自动地影响基于实体对齐任务常用的衡量标准的同一模型的性能。我们表明,这导致在解释结果方面出现各种问题,从而可能支持误导性的结论。因此,我们建议对评价进行调整,并从经验上表明这如何支持对模型业绩进行公平、可比较和可解释的评估。我们的代码可在https://github.com/mberr/rank-broad-valupation查阅。我们可在https://github.com/mberr/rank-supat-valation上查阅。