Language representations are efficient tools used across NLP applications, but they are strife with encoded societal biases. These biases are studied extensively, but with a primary focus on English language representations and biases common in the context of Western society. In this work, we investigate biases present in Hindi language representations with focuses on caste and religion-associated biases. We demonstrate how biases are unique to specific language representations based on the history and culture of the region they are widely spoken in, and how the same societal bias (such as binary gender-associated biases) is encoded by different words and text spans across languages. The discoveries of our work highlight the necessity of culture awareness and linguistic artifacts when modeling language representations, in order to better understand the encoded biases.
翻译:在这项工作中,我们调查印地语代表中存在的偏见,重点是种姓和宗教偏见;我们证明偏见是如何在基于该地区历史和文化的具体语言代表中独有的,以及同样的社会偏见(如二元性别偏见)如何以不同语言和文字编码,贯穿不同语言;我们的工作发现突出表明,在模拟语言代表时,文化意识和语言文物的必要性,以便更好地理解编码偏见。