文字中的性别偏见:标签数据集和词汇 (Gender Bias in Text: Labeled Datasets and Lexicons)

Language has a profound impact on our thoughts, perceptions, and conceptions of gender roles. Gender-inclusive language is, therefore, a key tool to promote social inclusion and contribute to achieving gender equality. Consequently, detecting and mitigating gender bias in texts is instrumental in halting its propagation and societal implications. However, there is a lack of gender bias datasets and lexicons for automating the detection of gender bias using supervised and unsupervised machine learning (ML) and natural language processing (NLP) techniques. Therefore, the main contribution of this work is to publicly provide labeled datasets and exhaustive lexicons by collecting, annotating, and augmenting relevant sentences to facilitate the detection of gender bias in English text. Towards this end, we present an updated version of our previously proposed taxonomy by re-formalizing its structure, adding a new bias type, and mapping each bias subtype to an appropriate detection methodology. The released datasets and lexicons span multiple bias subtypes including: Generic He, Generic She, Explicit Marking of Sex, and Gendered Neologisms. We leveraged the use of word embedding models to further augment the collected lexicons.

翻译：因此,在文本中发现和减少性别偏见有助于制止其传播和社会影响,然而,缺乏性别偏见数据集和词汇,无法利用监督和不受监督的机器学习(ML)和自然语言处理(NLP)技术自动发现性别偏见,因此,这项工作的主要贡献是通过收集、说明和增加相关句子,公开提供标签数据集和详尽的词汇,以便利发现英文文本中的性别偏见。为此,我们提出我们先前提议的分类法的更新版本,为此,我们调整其结构,增加新的偏见类型,将每一种偏见子类型划为适当的检测方法。所公布的数据集和词汇涵盖多种偏见子类型,包括:普通的He、普通的She、清晰的性别标识和性别新式。我们利用了语言嵌入模型来进一步加强所收集的分类法。