Bayesian Improved Surname Geocoding (BISG) is the most popular method for proxying race/ethnicity in voter registration files that do not contain it. This paper benchmarks BISG against a range of previously untested machine learning alternatives, using voter files with self-reported race/ethnicity from California, Florida, North Carolina, and Georgia. This analysis yields three key findings. First, machine learning consistently outperforms BISG at individual classification of race/ethnicity. Second, BISG and machine learning methods exhibit divergent biases for estimating regional racial composition. Third, the performance of all methods varies substantially across states. These results suggest that pre-trained machine learning models are preferable to BISG for individual classification. Furthermore, mixed results across states underscore the need for researchers to empirically validate their chosen race/ethnicity proxy in their populations of interest.
翻译:贝叶西亚改进南方地名地理编码(BISG)是选民登记档案中最常用的代用种族/族裔方法,其中不包括它。本文用加利福尼亚、佛罗里达、北卡罗来纳和乔治亚州自报种族/族裔的选民档案,参照一系列以前未经测试的机器学习替代方法,将BISG基准作为BISG基准。这一分析得出了三个主要结论。首先,机器学习在种族/族裔分类方面始终优于BISG。第二,BISG和机器学习方法在估计区域种族构成方面表现出不同偏差。第三,所有方法的绩效在各州之间差异很大。这些结果表明,预先培训的机器学习模式比BISG个人分类更为可取。此外,各州的混合结果突出表明,研究人员需要实证其感兴趣的人口中所选择的种族/族裔代用。