We argue that current IR metrics, modeled on optimizing user experience, measure too narrow a portion of the IR space. If IR systems are weak, these metrics undersample or completely filter out the deeper documents that need improvement. If IR systems are relatively strong, these metrics undersample deeper relevant documents that could underpin even stronger IR systems, ones that could present content from tens or hundreds of relevant documents in a user-digestible hierarchy or text summary. We reanalyze over 70 TREC tracks from the past 28 years, showing that roughly half undersample top ranked documents and nearly all undersample tail documents. We show that in the 2020 Deep Learning tracks, neural systems were actually near-optimal at top-ranked documents, compared to only modest gains over BM25 on tail documents. Our analysis is based on a simple new systems-oriented metric, 'atomized search length', which is capable of accurately and evenly measuring all relevant documents at any depth.
翻译:我们争论说,以优化用户经验为模型的当前IR指标过于狭隘,测量了IR空间的一部分。如果IR系统薄弱,这些指标就会低效或完全过滤需要改进的更深的文件。如果IR系统相对强大,这些指标就会低效、更深的相关文件,能够支持更强大的IR系统,那些可以在用户可理解的层次或文本摘要中显示数十或数百个相关文件的内容。我们重新分析了过去28年来70多个TREC轨道,表明大约一半的表层文件低效,几乎所有的底部尾文件。我们表明,在2020年深层学习轨道中,最高级文件的神经系统实际上接近最佳,而尾部文件的BM25只取得了微小的收益。我们的分析基于一个简单的系统导向性新指标,即“原子搜索长度 ”, 它能够准确和均衡地测量任何深度的所有相关文件。