区域 " 字嵌入式 " 中的负面区域偏见预测种族仇恨 -- -- 但仅通过名称频率 (Regional Negative Bias in Word Embeddings Predicts Racial Animus--but only via Name Frequency)

The word embedding association test (WEAT) is an important method for measuring linguistic biases against social groups such as ethnic minorities in large text corpora. It does so by comparing the semantic relatedness of words prototypical of the groups (e.g., names unique to those groups) and attribute words (e.g., 'pleasant' and 'unpleasant' words). We show that anti-black WEAT estimates from geo-tagged social media data at the level of metropolitan statistical areas strongly correlate with several measures of racial animus--even when controlling for sociodemographic covariates. However, we also show that every one of these correlations is explained by a third variable: the frequency of Black names in the underlying corpora relative to White names. This occurs because word embeddings tend to group positive (negative) words and frequent (rare) words together in the estimated semantic space. As the frequency of Black names on social media is strongly correlated with Black Americans' prevalence in the population, this results in spurious anti-Black WEAT estimates wherever few Black Americans live. This suggests that research using the WEAT to measure bias should consider term frequency, and also demonstrates the potential consequences of using black-box models like word embeddings to study human cognition and behavior.

翻译：嵌入协会测试(WEAT)是衡量针对社会群体的语言偏见的一个重要方法,比如在大型文本公司中少数民族(WEAT),它通过比较各群体原型词(如这些群体独有的名称)和属性单词(如“喜悦”和“不愉快”单词)的语义关联性关联性(WEAT)和属性单词(例如,“喜悦”和“不愉快”单词)。我们表明,都市统计地区地理标记的社交媒体数据中反黑人WEAT估计值与若干种族隐性衡量值(即使在控制社会人口变异性时也是如此)密切相关。然而,我们也表明,其中每一种关联性都由第三个变量来解释:黑名相对于白名的内在社团中的频度(如这些群名)和属性单词(如“喜悦”单词)的频度关联性关联性关系。我们发现,由于社会媒体上的黑名与美黑人在人口中的流行程度密切相关,因此,在少数黑人居住的地区,反美籍的WEAT估计结果都是虚假的。这表明,使用“WAT”的频度研究,例如“人类“实验”等,将“错误”的模型进行。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

【ETH】最新《几何数据分析》2020课程，附PPT下载

专知会员服务

45+阅读 · 2020年12月18日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日