一个针对审稿人分配问题的黄金标准数据集 (A Gold Standard Dataset for the Reviewer Assignment Problem)

Many peer-review venues are either using or looking to use algorithms to assign submissions to reviewers. The crux of such automated approaches is the notion of the "similarity score"--a numerical estimate of the expertise of a reviewer in reviewing a paper--and many algorithms have been proposed to compute these scores. However, these algorithms have not been subjected to a principled comparison, making it difficult for stakeholders to choose the algorithm in an evidence-based manner. The key challenge in comparing existing algorithms and developing better algorithms is the lack of the publicly available gold-standard data that would be needed to perform reproducible research. We address this challenge by collecting a novel dataset of similarity scores that we release to the research community. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers who evaluated their expertise in reviewing papers they have read previously. We use this data to compare several popular algorithms employed in computer science conferences and come up with recommendations for stakeholders. Our main findings are as follows. First, all algorithms make a non-trivial amount of error. For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases, highlighting the vital need for more research on the similarity-computation problem. Second, most existing algorithms are designed to work with titles and abstracts of papers, and in this regime the Specter+MFR algorithm performs best. Third, to improve performance, it may be important to develop modern deep-learning based algorithms that can make use of the full texts of papers: the classical TD-IDF algorithm enhanced with full texts of papers is on par with the deep-learning based Specter+MFR that cannot make use of this information.

翻译：许多同行评审场馆正在使用或寻求使用算法将提交物分配给审稿人。这种自动化方法的关键在于“相似度分数”的概念——评审人在审查论文方面的专业水平的数字估计——许多算法已经被提出来计算这些分数。然而，这些算法没有经过原理性的比较，这使得利益相关者难以以依据证据的方式选择算法。比较现有算法和开发更好算法的关键挑战是缺乏公开可用的黄金标准数据，以进行可重复性研究。我们通过收集一组新颖的相似性评分数据来解决这个挑战，并将其发布给研究社区。我们的数据集包含由58位研究人员提供的477个自我报告的专业熟练度评分，他们评估了自己在审读先前阅读的论文方面的专业熟练度。我们使用这些数据来比较计算机科学会议中使用的几种流行算法，并为利益相关者提供建议。我们的主要发现如下。首先，所有算法都会产生一定数量的错误。对于按其与审稿人相关性进行排序的两篇论文的任务，在易任务中，误差率在12%至30%之间，在难任务中误差率在36%至43%之间，突出了需要对相似度计算问题进行更多研究的重要性。其次，大多数现有算法都是设计用于处理论文的标题和摘要，在这个领域，Specter+MFR算法表现最佳。第三，为了提高性能，可能很重要开发基于现代深度学习的算法，可以利用完整的论文文本：加强的经典TD-IDF算法在利用完整的论文文本的情况下与不能利用此信息的深度学习Specter+MFR算法表现相当。