Recent work has shown that distributional word vector spaces often encode human biases like sexism or racism. In this work, we conduct an extensive analysis of biases in Arabic word embeddings by applying a range of recently introduced bias tests on a variety of embedding spaces induced from corpora in Arabic. We measure the presence of biases across several dimensions, namely: embedding models (Skip-Gram, CBOW, and FastText) and vector sizes, types of text (encyclopedic text, and news vs. user-generated content), dialects (Egyptian Arabic vs. Modern Standard Arabic), and time (diachronic analyses over corpora from different time periods). Our analysis yields several interesting findings, e.g., that implicit gender bias in embeddings trained on Arabic news corpora steadily increases over time (between 2007 and 2017). We make the Arabic bias specifications (AraWEAT) publicly available.
翻译:最近的工作表明,分布式文字矢量空间往往将性别主义或种族主义等人类偏见混为一谈。在这项工作中,我们广泛分析阿拉伯文字嵌入中的偏见,对从阿拉伯公司引出的各种嵌入空间进行一系列最近引入的偏见测试。我们测量了存在多个层面的偏见,即:嵌入模型(Skip-Gram、CBOW和FastText)和矢量大小、文字类型(百科文本和新闻对用户生成的内容)、方言(埃及阿拉伯文对现代标准阿拉伯文)和时间(对不同时期的团团进行时间分析),我们的分析得出了若干有趣的结论,例如,在经过培训的阿拉伯新闻团团内嵌入的隐含的性别偏见随着时间的推移(在2007年至2017年期间)稳步增加。我们公布了阿拉伯语偏见规范(AraWEAT)。