We examine whether some countries are more richly represented in embedding space than others. We find that countries whose names occur with low frequency in training corpora are more likely to be tokenized into subwords, are less semantically distinct in embedding space, and are less likely to be correctly predicted: e.g., Ghana (the correct answer and in-vocabulary) is not predicted for, "The country producing the most cocoa is [MASK].". Although these performance discrepancies and representational harms are due to frequency, we find that frequency is highly correlated with a country's GDP; thus perpetuating historic power and wealth inequalities. We analyze the effectiveness of mitigation strategies; recommend that researchers report training word frequencies; and recommend future work for the community to define and design representational guarantees.
翻译:我们研究一些国家在嵌入空间方面是否比其他国家更具有更丰富的代表性。我们发现,在培训公司中低频率出现国名的国家更有可能被象征成子字,在嵌入空间方面不太具有内在的区别,也不太可能得到正确预测:例如,加纳(正确的答案和在词汇中)没有预测到“生产可可最多的国家是[MASK]”。虽然这些绩效差异和代表性伤害是频繁造成的,但我们发现,频率与一个国家的GDP密切相关,从而延续了历史权力和财富不平等。我们分析了缓解战略的有效性;建议研究人员报告语言频率培训;建议社区今后界定和设计代表性保障的工作。