Over the last years, word and sentence embeddings have established as text preprocessing for all kinds of NLP tasks and improved performances in these tasks significantly. Unfortunately, it has also been shown that these embeddings inherit various kinds of biases from the training data and thereby pass on biases present in society to NLP solutions. Many papers attempted to quantify bias in word or sentence embeddings to evaluate debiasing methods or compare different embedding models, often with cosine-based scores. However, some works have raised doubts about these scores showing that even though they report low biases, biases persist and can be shown with other tests. In fact, there is a great variety of bias scores or tests proposed in the literature without any consensus on the optimal solutions. We lack works that study the behavior of bias scores and elaborate their advantages and disadvantages. In this work, we will explore different cosine-based bias scores. We provide a bias definition based on the ideas from the literature and derive novel requirements for bias scores. Furthermore, we thoroughly investigate the existing cosine-based scores and their limitations in order to show why these scores fail to report biases in some situations. Finally, we propose a new bias score, SAME, to address the shortcomings of existing bias scores and show empirically that SAME is better suited to quantify biases in word embeddings.
翻译:过去几年来,单词和句子嵌入式被确定为所有类型的国家劳工政策任务的案文预处理,这些任务的业绩得到显著改善。不幸的是,还显示这些嵌入式继承了培训数据中的各种偏见,从而将社会上存在的偏见传给国家劳工政策解决方案。许多文件试图量化文字或句子嵌入式中的偏见,以评价贬低偏见的方法,或比较不同的嵌入模式,通常以共弦计分。然而,有些工作使人们对这些分数产生怀疑,显示尽管它们报告偏见低,偏见持续存在,并可以通过其他测试来显示。事实上,文献中提议的偏见评分或测试有很多种,但没有就最佳解决方案达成共识。我们缺乏研究偏见评分行为和阐述其优缺点的工作。在这项工作中,我们将探讨基于差异的偏见得分的不同评分。我们根据文献的理念给出了偏见定义,并提出了对偏差分的新要求。此外,我们彻底调查现有的基于cosine的得分及其局限性,以表明为什么这些分数无法报告某些情况中的偏见。最后,我们建议一种基于标准的偏差分,是更准确的SA。