Evaluation efforts such as TREC, CLEF, NTCIR and FIRE, alongside public leaderboard such as MS MARCO, are intended to encourage research and track our progress, addressing big questions in our field. However, the goal is not simply to identify which run is "best", achieving the top score. The goal is to move the field forward by developing new robust techniques, that work in many different settings, and are adopted in research and practice. This paper uses the MS MARCO and TREC Deep Learning Track as our case study, comparing it to the case of TREC ad hoc ranking in the 1990s. We show how the design of the evaluation effort can encourage or discourage certain outcomes, and raising questions about internal and external validity of results. We provide some analysis of certain pitfalls, and a statement of best practices for avoiding such pitfalls. We summarize the progress of the effort so far, and describe our desired end state of "robust usefulness", along with steps that might be required to get us there.
翻译:评估工作,如TREC、CLEF、NTCIR和FIRE等,与MS MARCO等公共领导机构一道,旨在鼓励研究和跟踪我们的进展,解决我们领域的重大问题,然而,目标不仅仅是确定什么运行是“最佳”,达到顶分,目标是通过开发新的强力技术推动实地前进,这些技术在许多不同环境中运作,并在研究和实践中采用。本文件利用MS MARCO和TREC深层学习轨道作为我们的案例研究,将其与1990年代TREC特设排名案例进行比较。我们展示了评价工作的设计如何鼓励或阻止某些结果,并就结果的内部和外部有效性提出问题。我们对某些陷阱进行了一些分析,并介绍了避免这种陷阱的最佳做法。我们总结了迄今为止的工作进展,并描述了我们所期望的“腐败有用性”最终状态,以及使我们得以实现的步骤。