同行审议中比数测试 (On Testing for Biases in Peer Review)

We consider the issue of biases in scholarly research, specifically, in peer review. There is a long standing debate on whether exposing author identities to reviewers induces biases against certain groups, and our focus is on designing tests to detect the presence of such biases. Our starting point is a remarkable recent work by Tomkins, Zhang and Heavlin which conducted a controlled, large-scale experiment to investigate existence of biases in the peer reviewing of the WSDM conference. We present two sets of results in this paper. The first set of results is negative, and pertains to the statistical tests and the experimental setup used in the work of Tomkins et al. We show that the test employed therein does not guarantee control over false alarm probability and under correlations between relevant variables coupled with any of the following conditions, with high probability, can declare a presence of bias when it is in fact absent: (a) measurement error, (b) model mismatch, (c) reviewer calibration. Moreover, we show that the setup of their experiment may itself inflate false alarm probability if (d) bidding is performed in non-blind manner or (e) popular reviewer assignment procedure is employed. Our second set of results is positive and is built around a novel approach to testing for biases that we propose. We present a general framework for testing for biases in (single vs. double blind) peer review. We then design hypothesis tests that under minimal assumptions guarantee control over false alarm probability and non-trivial power even under conditions (a)--(c) as well as propose an alternative experimental setup which mitigates issues (d) and (e). Finally, we show that no statistical test can improve over the non-parametric tests we consider in terms of the assumptions required to control for the false alarm probability.

翻译：我们考虑学术研究中的偏见问题,特别是在同侪审查中。关于将作者身份暴露于审查者身份是否会对某些群体产生偏见的问题,存在长期争论,我们的重点是设计测试,以发现是否存在这种偏见。我们的出发点是Tomkins、张和海夫林最近所做的一项了不起的工作,该研究为调查WSDM会议的同侪审查中是否存在偏见的问题进行了有控制的大规模实验。我们在本文件中提出了两套结果。第一套结果是否定的,甚至与Tomkins等人的工作中所使用的统计测试和实验设置有关。我们表明,其中采用的测试并不能保证对错误的警报概率进行控制,而且根据相关变量与任何以下条件的关联进行检测。我们的一个显著的出发点是:当事实不存在时,Tomkins、Zhang和Heavlin能够宣布存在偏见:(a) 测量错误,(b) 模型不匹配,(c) 审查者校准。此外,我们表明,如果(d) 招标是以非盲目的方式进行,或(e) 大众审查中采用的概率分析程序,则会降低不准确性。我们设定了一种标准,因此,我们在进行一般的测试时,在进行这样的测试时会设定一个不精确的检验。(我们提出一种不精确的检验。) 。我们设定了一种不测测测测测测测测。我们在在进行。