This paper presents a new method for automatically detecting words with lexical gender in large-scale language datasets. Currently, the evaluation of gender bias in natural language processing relies on manually compiled lexicons of gendered expressions, such as pronouns ('he', 'she', etc.) and nouns with lexical gender ('mother', 'boyfriend', 'policewoman', etc.). However, manual compilation of such lists can lead to static information if they are not periodically updated and often involve value judgments by individual annotators and researchers. Moreover, terms not included in the list fall out of the range of analysis. To address these issues, we devised a scalable, dictionary-based method to automatically detect lexical gender that can provide a dynamic, up-to-date analysis with high coverage. Our approach reaches over 80% accuracy in determining the lexical gender of nouns retrieved randomly from a Wikipedia sample and when testing on a list of gendered words used in previous research.
翻译:本文介绍了在大规模语言数据集中自动发现带有词汇性别的词组的新方法。目前,对自然语言处理中的性别偏见的评价依赖于人工汇编的性别表达法,如名词('he'、'she'等)和名词('母亲'、'男友'、'女警察'等),但是,如果这些名单不定期更新,而且经常涉及个别告发者和研究人员的价值判断,手工汇编这些名单可能导致静态信息。此外,清单中未列入的术语不属于分析范围。为了解决这些问题,我们设计了一个可缩放的字典法方法,以自动检测具有高度覆盖面的词汇性别。我们的方法在确定从维基百科样本中随机检索的名词的词汇性别以及测试先前研究中使用的性别词汇清单时达到80%的准确度。