We provide the largest compiled publicly available dictionaries of first, middle, and last names for the purpose of imputing race and ethnicity using, for example, Bayesian Improved Surname Geocoding (BISG). The dictionaries are based on the voter files of six Southern states that collect self-reported racial data upon voter registration. Our data cover a much larger scope of names than any comparable dataset, containing roughly one million first names, 1.1 million middle names, and 1.4 million surnames. Individuals are categorized into five mutually exclusive racial and ethnic groups -- White, Black, Hispanic, Asian, and Other -- and racial/ethnic counts by name are provided for every name in each dictionary. Counts can then be normalized row-wise or column-wise to obtain conditional probabilities of race given name or name given race. These conditional probabilities can then be deployed for imputation in a data analytic task for which ground truth racial and ethnic data is not available.
翻译:我们利用例如巴伊西亚改进南方地名地理编码(BISG),为计算种族和族裔提供了最大的、可公开查阅的第一、中间和最后字典。词典依据的是在选民登记时收集自我报告的种族数据的6个南方州的选民档案。我们的数据涵盖的地名范围远大于任何可比数据集,包含大约100万首名、110万中名和140万个姓氏。个人分为五个相互排斥的种族和族裔群体 -- -- 白人、黑人、西班牙裔、亚洲人和其他 -- -- 以及每个字典中每个名字的种族/族裔点名。然后,数字可以划成正常的行或列,以获得种族名称或名称的有条件概率。然后,这些有条件的概率可以用于估算数据分析任务,无法提供真实的种族和族裔数据。