从基线到最优秀表演者:在TREC 2021年对话援助轨道上对方法的复制研究 (From Baseline to Top Performer: A Reproducibility Study of Approaches at the TREC 2021 Conversational Assistance Track)

This paper reports on an effort of reproducing the organizers' baseline as well as the top performing participant submission at the 2021 edition of the TREC Conversational Assistance track. TREC systems are commonly regarded as reference points for effectiveness comparison. Yet, the papers accompanying them have less strict requirements than peer-reviewed publications, which can make reproducibility challenging. Our results indicate that key practical information is indeed missing. While the results can be reproduced within a 19% relative margin with respect to the main evaluation measure, the relative difference between the baseline and the top performing approach shrinks from the reported 18% to 5%. Additionally, we report on a new set of experiments aimed at understanding the impact of various pipeline components. We show that end-to-end system performance can indeed benefit from advanced retrieval techniques in either stage of a two-stage retrieval pipeline. We also measure the impact of the dataset used for fine-tuning the query rewriter and find that employing different query rewriting methods in different stages of the retrieval pipeline might be beneficial. Moreover, these results are shown to generalize across the 2020 and 2021 editions of the track. We conclude our study with a list of lessons learned and practical suggestions.

翻译：本文报告了在2021年版的TREC对话援助轨道上复制组织者基线和最佳执行方提交文件的努力。TREC系统通常被视为成效比较的参考点。然而,随附文件的要求不如经同行审查的出版物严格,这可能会带来重复挑战。我们的结果表明,关键实用信息确实缺乏。虽然在主要评价措施方面,结果可以在19%的相对幅度内复制,但基准与业绩最佳方法之间的相对差异从报告的18%到5%有所缩小。此外,我们还报告了一套旨在了解各种管道组成部分影响的新的实验。我们表明,在两阶段检索管道的任一阶段,终端到终端系统的业绩确实都能从先进的检索技术中受益。我们还测量了用于调整查询再写器的数据集的影响,发现在回收管道的不同阶段采用不同的查询重写方法可能是有益的。此外,这些结果显示,在2020年和2021年版的轨道上,我们以经验教训和实用建议清单来完成我们的研究。