Racial disparity in academia is a widely acknowledged problem. The quantitative understanding of racial-based systemic inequalities is an important step towards a more equitable research system. However, few large-scale analyses have been performed on this topic, mostly because of the lack of robust race-disambiguation algorithms. Identifying author information does not generally include the author's race. Therefore, an algorithm needs to be employed, using known information about authors, i.e., their names, to infer their perceived race. Nevertheless, as any other algorithm, the process of racial inference can generate biases if it is not carefully considered. When the research is focused on the understanding of racial-based inequalities, such biases undermine the objectives of the investigation and may perpetuate inequities. The goal of this article is to assess the biases introduced by the different approaches used name-based racial inference. We use information from US census and mortgage applications to infer the race of US author names in the Web of Science. We estimate the effects of using given and family names, thresholds or continuous distributions, and imputation. Our results demonstrate that the validity of name-based inference varies by race and ethnicity and that threshold approaches underestimate Black authors and overestimate White authors. We conclude with recommendations to avoid potential biases. This article fills an important research gap that will allow more systematic and unbiased studies on racial disparity in science.
翻译:学术界的种族差异是一个广泛公认的问题。对种族为基础的系统性制度不平等的定量理解是朝向更公平的研究制度迈出的重要一步。然而,对这一专题很少进行大规模分析,这主要是因为缺乏强大的种族分辨算算法。 确定作者的资料一般不包括作者的种族。因此,需要使用算法,利用关于作者的已知资料,即他们的姓名,推断他们认为的种族。然而,如任何其他算法一样,种族推论过程如果不仔细考虑,就会产生偏见。当研究侧重于了解种族不平等时,这种偏见会损害调查的目标,并可能使不公平现象长期存在下去。这一文章的目的是评估使用不同方法所引入的基于姓名的种族推论偏见。我们使用美国人口普查和抵押申请中的信息来推断美国作者姓名的种族,从而推断科学网络中的美国作者姓名、门槛或持续分布和指责的影响。我们的结果表明,以名称推理为依据的正确性判断,会因种族和种族偏见的不同而不同,从而可以避免按种族和种族、种族、种族、肤色、肤色、肤色、肤色、肤色、肤色、肤色、肤色、肤色、肤色、肤色、肤色、肤色、肤色、肤色、肤色、肤色、肤色、肤色、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、语言、