Prediction of individual's race and ethnicity plays an important role in social science and public health research. Examples include studies of racial disparity in health and voting. Recently, Bayesian Improved Surname Geocoding (BISG), which uses Bayes' rule to combine information from Census surname files with the geocoding of an individual's residence, has emerged as a leading methodology for this prediction task. Unfortunately, BISG suffers from two Census data problems that contribute to unsatisfactory predictive performance for minorities. First, the decennial Census often contains zero counts for minority racial groups in the Census blocks where some members of those groups reside. Second, because the Census surname files only include frequent names, many surnames -- especially those of minorities -- are missing from the list. To address the zero counts problem, we introduce a fully Bayesian Improved Surname Geocoding (fBISG) methodology that accounts for potential measurement error in Census counts by extending the naive Bayesian inference of the BISG methodology to full posterior inference. To address the missing surname problem, we supplement the Census surname data with additional data on last, first, and middle names taken from the voter files of six Southern states where self-reported race is available. Our empirical validation shows that the fBISG methodology and name supplements significantly improve the accuracy of race imputation across all racial groups, and especially for Asians. The proposed methodology, together with additional name data, is available via the open-source software WRU.
翻译:对个人种族和族裔的预测在社会科学和公共卫生研究中起着重要作用,例如,对健康和投票方面的种族差异的研究。最近,巴伊西亚改进 Surname Geocoding (BISG) 利用巴伊西亚改进 Surname Geocoding (BISG) 规则将人口普查的姓氏档案资料与个人住所的地理编码相结合,这已成为这一预测任务的主要方法。不幸的是,BISG 面临两个人口普查数据问题,导致少数群体的预测性表现不尽人意。首先,十年一度的人口普查往往包含这些群体中某些成员居住的普查区少数民族种族群体的零计数。第二,由于人口普查的姓氏档案中只包括经常的名字,许多姓氏(特别是少数民族的姓氏)在清单上缺失。为了解决零计数问题,我们采用了完全的巴伊西亚改进Surate Geococoding (fBISG) 的方法,将BISG方法的天天真推推论推论推算出来,将BIS方法推算得更彻底。为了解决失踪的姓氏问题,我们用普查的姓氏数据补充了普查的最近、第一个和中间的版本数据,特别的地名数据,我们从VISLILA的所有选举的精确方法,从现有六种中检索方法中用了所有现有。