Prediction of an individual's race and ethnicity plays an important role in social science and public health research. Examples include studies of racial disparity in health and voting. Recently, Bayesian Improved Surname Geocoding (BISG), which uses Bayes' rule to combine information from Census surname files with the geocoding of an individual's residence, has emerged as a leading methodology for this prediction task. Unfortunately, BISG suffers from two Census data problems that contribute to unsatisfactory predictive performance for minorities. First, the decennial Census often contains zero counts for minority racial groups in the Census blocks where some members of those groups reside. Second, because the Census surname files only include frequent names, many surnames -- especially those of minorities -- are missing from the list. To address the zero counts problem, we introduce a fully Bayesian Improved Surname Geocoding (fBISG) methodology that accounts for potential measurement error in Census counts by extending the na\"ive Bayesian inference of the BISG methodology to full posterior inference. To address the missing surname problem, we supplement the Census surname data with additional data on last, first, and middle names taken from the voter files of six Southern states where self-reported race is available. Our empirical validation shows that the fBISG methodology and name supplements significantly improve the accuracy of race imputation across all racial groups, and especially for Asians. The proposed methodology, together with additional name data, is available via the open-source software package wru.
翻译:对一个人的种族和族裔的预测在社会科学和公共卫生研究中起着重要作用,例如,对健康和投票方面的种族差异的研究。最近,巴伊西亚改进 Surname Geocoding (BISG) 利用拜伊斯规则将人口普查的姓氏档案资料与个人住所的地理编码相结合,这已成为这一预测任务的主要方法。不幸的是,巴伊西亚改进 Surname Geocoding (fBISG) 的普查数据存在两个问题,导致少数群体的预测性表现不尽人意。首先,十年一次的人口普查往往包含这些群体中某些成员居住的普查区少数民族种族群体的零计数。第二,由于人口普查的姓氏档案只包括经常的名字,许多姓氏(特别是少数民族的姓氏)在清单上缺失。为了解决零计数问题,我们采用了完全的巴伊西亚改进Surname Giscoding (fBISG) 的方法,将提议的BISG 方法的缩略图扩展为完全的软件推算结果。为了解决失踪的姓氏问题,我们用普查的姓氏数据补充了人口普查数据,许多姓名数据,特别是少数民族的姓氏在清单中标的精确度上的数据。