In evaluation campaigns, participants often explore variations of popular, state-of-the-art baselines as a low-risk strategy to achieve competitive results. While effective, this can lead to local "hill climbing" rather than more radical and innovative departure from standard methods. Moreover, if many participants build on similar baselines, the overall diversity of approaches considered may be limited. In this work, we propose a new class of IR evaluation metrics intended to promote greater diversity of approaches in evaluation campaigns. Whereas traditional IR metrics focus on user experience, our two "innovation" metrics instead reward exploration of more divergent, higher-risk strategies finding relevant documents missed by other systems. Experiments on four TREC collections show that our metrics do change system rankings by rewarding systems that find such rare, relevant documents. This result is further supported by a controlled, synthetic data experiment, and a qualitative analysis. In addition, we show that our metrics achieve higher evaluation stability and discriminative power than the standard metrics we modify. To support reproducibility, we share our source code.
翻译:在评价运动中,参与者往往探索流行的、最先进的基线,作为取得竞争性成果的低风险战略。这虽然有效,但可能导致当地“攀山”而不是更加激进和创新地偏离标准方法。此外,如果许多参与者建立类似的基线,那么所考虑的方法的总体多样性可能有限。在这项工作中,我们提出一个新的IR评价指标类别,目的是在评价运动中促进更多样化的方法。传统的IR指标侧重于用户经验,而我们的两个“创新”指标则奖励探索更不同、更高风险的战略,寻找其他系统丢失的相关文件。对TREC的实验表明,我们的指标通过奖励找到如此罕见的相关文件的系统来改变系统排名。这一结果得到受控、综合数据试验和定性分析的进一步支持。此外,我们表明我们的指标比我们修改的标准指标更稳定,具有歧视性的力量。为了支持可追溯性,我们分享我们的源代码。