Intrinsic evaluations of OIE systems are carried out either manually -- with human evaluators judging the correctness of extractions -- or automatically, on standardized benchmarks. The latter, while much more cost-effective, is less reliable, primarily because of the incompleteness of the existing OIE benchmarks: the ground truth extractions do not include all acceptable variants of the same fact, leading to unreliable assessment of models' performance. Moreover, the existing OIE benchmarks are available for English only. In this work, we introduce BenchIE: a benchmark and evaluation framework for comprehensive evaluation of OIE systems for English, Chinese and German. In contrast to existing OIE benchmarks, BenchIE takes into account informational equivalence of extractions: our gold standard consists of fact synsets, clusters in which we exhaustively list all surface forms of the same fact. We benchmark several state-of-the-art OIE systems using BenchIE and demonstrate that these systems are significantly less effective than indicated by existing OIE benchmarks. We make BenchIE (data and evaluation code) publicly available.
翻译:对国际兽疫局系统的内在评价是人工进行的 -- -- 由人类评价人员来判断提取的正确性 -- -- 或自动地根据标准化基准进行。标准化基准虽然成本效益高得多,但不太可靠,主要是因为国际兽疫局现有基准不完整:地面真相挖掘并不包括同一事实的所有可接受的变体,导致对模型性能的评估不可靠。此外,现有国际兽疫局基准只有英文可用。在这项工作中,我们引入了Bechie:一个全面评价英国、中国和德国国际兽疫局系统的基准和评价框架。与现有的国际兽疫局基准相比,国际兽疫局考虑了抽取的信息等同性:我们的黄金标准包括事实合成群,我们详尽地列出了同一事实的所有表面形式。我们用国际兽疫局基准对若干最新的国际兽疫局系统进行了基准,并表明这些系统比现有国际兽疫局基准显示的效用要小得多。我们公开提供国际兽疫局(数据和评价代码)。