Social bias in language - towards genders, ethnicities, ages, and other social groups - poses a problem with ethical impact for many NLP applications. Recent research has shown that machine learning models trained on respective data may not only adopt, but even amplify the bias. So far, however, little attention has been paid to bias in computational argumentation. In this paper, we study the existence of social biases in large English debate portals. In particular, we train word embedding models on portal-specific corpora and systematically evaluate their bias using WEAT, an existing metric to measure bias in word embeddings. In a word co-occurrence analysis, we then investigate causes of bias. The results suggest that all tested debate corpora contain unbalanced and biased data, mostly in favor of male people with European-American names. Our empirical insights contribute towards an understanding of bias in argumentative data sources.
翻译:在语言上的社会偏见――针对性别、种族、年龄和其他社会群体――对许多国家劳工政策应用提出了道德影响的问题。最近的研究表明,在相关数据方面受过培训的机器学习模式不仅可能采用,甚至可能扩大偏见。然而,迄今为止,对于计算论中的偏见很少重视。在本文中,我们研究了在大型英语辩论门户中存在社会偏见的问题。特别是,我们培训了将文字嵌入门户网站特定社团的模范,并系统地评估了这些社团的偏见。WEAT是衡量文字嵌入偏见的现有衡量标准。在一项单词共生分析中,我们随后调查了偏见的原因。结果显示,所有经过测试的辩论社团都含有不平衡和偏向性的数据,大多有利于欧洲裔美国人的男性。我们的经验见解有助于理解有争议的数据来源中的偏见。