As natural language processing (NLP) for gender bias becomes a significant interdisciplinary topic, the prevalent data-driven techniques such as large-scale language models suffer from data inadequacy and biased corpus, especially for languages with insufficient resources such as Chinese. To this end, we propose a Chinese cOrpus foR Gender bIas Probing and Mitigation CORGI-PM, which contains 32.9k sentences with high-quality labels derived by following an annotation scheme specifically developed for gender bias in the Chinese context. Moreover, we address three challenges for automatic textual gender bias mitigation, which requires the models to detect, classify, and mitigate textual gender bias. We also conduct experiments with state-of-the-art language models to provide baselines. To our best knowledge, CORGI-PM is the first sentence-level Chinese corpus for gender bias probing and mitigation.
翻译:由于性别偏见的自然语言处理(NLP)成为一个重要的跨学科主题,大规模语言模型等普遍的数据驱动技术存在数据不足和偏见,特别是对于诸如中文等资源不足的语言而言。为此,我们建议采用中国的COPUS fOR FOR GEBIA Probbing和减轻性别偏见(CORGI-PM),其中包括32.9k句,并配有高质量的标签。此外,我们处理自动文字性别偏见缓解的三项挑战,这需要用模型来检测、分类和减轻文字性别偏见。我们还用最先进的语言模型进行实验,以提供基线。据我们所知,CORGI-PM是中国首个关于性别偏见调查和减轻的句子。