Although representational retrieval models based on Transformers have been able to make major advances in the past few years, and despite the widely accepted conventions and best-practices for testing such models, a $\textit{standardized}$ evaluation framework for testing them has not been developed. In this work, we formalize the best practices and conventions followed by researchers in the literature, paving the path for more standardized evaluations - and therefore more fair comparisons between the models. Our framework (1) embeds the documents and queries; (2) for each query-document pair, computes the relevance score based on the dot product of the document and query embedding; (3) uses the $\texttt{dev}$ set of the MSMARCO dataset to evaluate the models; (4) uses the $\texttt{trec_eval}$ script to calculate MRR@100, which is the primary metric used to evaluate the models. Most importantly, we showcase the use of this framework by experimenting on some of the most well-known dense retrieval models.
翻译:虽然基于变换器的代表性检索模型在过去几年中取得了重大进步,尽管测试这些模型的公约和最佳实践得到广泛接受,但还没有开发出用于测试这些模型的美元(textit{标准化)评价框架。在这项工作中,我们正式确定了研究人员在文献中遵循的最佳做法和公约,为更标准化的评价铺平了道路,从而在模型之间进行更公平的比较。我们的框架(1) 嵌入了文件和查询;(2) 每对查询文件,根据文件和查询嵌入的点产品计算相关评分;(3) 利用MSMARCO数据集中的美元(textt{dev})来评估模型;(4) 使用美元(trec_eval})的脚本计算MRR@100,这是用来评估模型的主要标准。最重要的是,我们通过实验一些最著名的密度检索模型来展示这一框架的使用情况。